Methods and systems for multiple taxonomic classification

ABSTRACT

Described herein are methods of identifying a plurality of polynucleotides, as well as detecting presence, absence, or abundance of a plurality of taxa in a sample. Also provided are systems for performing methods of the disclosure.

CROSS-REFERENCE

This application is a continuation of U.S. patent application Ser. No.15/724,476, filed on Oct. 4, 2017, which is a continuation of PCTapplication PCT/US2016/029067, filed on Apr. 22, 2016, which claims thebenefit of U.S. Provisional Application No. 62/152,782, filed Apr. 24,2015, which application is incorporated herein by reference.

BACKGROUND OF THE INVENTION

Metagenomics, the genomic analysis of a population of microorganisms,makes possible the profiling of microbial communities in the environmentand the human body at unprecedented depth and breadth. Its rapidlyexpanding use is revolutionizing our understanding of microbialdiversity in natural and man-made environments and is linking microbialcommunity profiles with health and disease. To date, most studies haverelied on PCR amplification of microbial marker genes (e.g. bacterial16S rRNA), for which large, curated databases have been established.More recently, higher throughput and lower cost sequencing technologieshave enabled a shift towards enrichment-independent metagenomics. Theseapproaches reduce bias, improve detection of less abundant taxa, andenable discovery of novel pathogens

While conventional, pathogen-specific nucleic acid amplification testsare highly sensitive and specific, they require a priori knowledge oflikely pathogens. The result is increasingly large, yet inherentlylimited diagnostic panels to enable diagnosis of the most commonpathogens. In contrast, enrichment-independent high-throughputsequencing allows for unbiased, hypothesis-free detection and moleculartyping of a theoretically unlimited number of common and unusualpathogens. Wide availability of next-generation sequencing instruments,lower reagent costs, and streamlined sample preparation protocols areenabling an increasing number of investigators to performhigh-throughput DNA and RNA-seq for metagenomics studies. However,analysis of sequencing data is still forbiddingly difficult and timeconsuming, requiring bioinformatics skills, computational resources, andmicrobiological expertise that is not available to many laboratories,especially diagnostic ones.

SUMMARY OF THE INVENTION

In view of the foregoing, more computationally efficient, accurate, andeasy-to-use tools for comprehensive diagnostic and metagenomics analysesare needed. The methods and systems described herein address this need,and provide other advantages as well.

In one aspect, the present disclosure provides a method of identifying aplurality of polynucleotides in a sample from a sample source. In someembodiments, the method comprises providing sequencing reads for aplurality of polynucleotides from the sample, and for each sequencingread: (a) performing with a computer system a sequence comparisonbetween the sequencing read and a plurality of reference polynucleotidesequences, wherein the comparison comprises calculating k-mer weights asa measure of how likely it is that k-mers within the sequencing read arederived from a reference sequence within the plurality of referencepolynucleotide sequences; (b) identifying the sequencing read ascorresponding to a particular reference sequence in a database ofreference sequences if the sum of k-mer weights for the referencesequence is above a threshold level; and (c) assembling a recorddatabase comprising reference sequences identified in step (b), whereinthe record database excludes reference sequences to which no sequencingread corresponds.

In another aspect, the present disclosure provides a method ofidentifying one or more taxa in a sample from a sample source, themethod comprising: (a) providing sequencing reads for a plurality ofpolynucleotides from the sample, and for each sequencing read: (i)performing with a computer system a sequence comparison between thesequencing read and a plurality of reference polynucleotide sequences,wherein the comparison comprises calculating k-mer weights as a measureof how likely it is that k-mers within the sequencing read are derivedfrom a reference sequence within the plurality of referencepolynucleotide sequences; and (ii) calculating a probability that thesequencing read corresponds to a particular reference sequence in adatabase of reference sequences based on the k-mer weights, therebygenerating a sequence probability; (b) calculating a score for thepresence or absence of one or more taxa based on the sequenceprobabilities corresponding to sequences representative of said one ormore taxa; and (c) identifying the one or more taxa as present or absentin the sample based on the corresponding scores. In some embodiments,the one or more taxa comprise a first bacterial strain identified aspresent and a second bacterial strain identified as absent based on oneor more nucleotide differences in sequence. In some embodiments, thefirst bacterial strain is identified as present and the second bacterialstrain is identified as absent based on a single nucleotide differencein sequence. In some embodiments, the method further comprisesidentifying a condition of the sample source by comparison of theresults of step (c) to a biosignature.

In some embodiments of any of the various aspects of the disclosure,each reference sequence in the database of reference sequences isassociated with, prior to the comparison, a reference k-mer weight as ameasure of how likely it is that a k-mer within the reference sequenceoriginates from the reference sequence. In some embodiments, thedatabase of reference sequences comprise sequences from a plurality oftaxa, and each reference sequence in the database of reference sequencesis associated with a reference k-mer weight as a measure of how likelyit is that a k-mer within the reference sequence originates from a taxonwithin the plurality of taxa. One or more of the steps may be performedfor all sequencing reads in parallel, such as the step of performing thesequence comparison. The method may further comprise quantifying anamount of polynucleotides corresponding to the reference sequencesidentified in step (b) based on a number of corresponding sequencingreadings. In some embodiments, the method further comprises determiningpresence, absence, or abundance of a plurality of taxa in the samplebased on results of step (b), wherein the plurality of referencepolynucleotide sequences comprise groups of sequences corresponding toindividual taxa in the plurality of taxa. A sequencing read identifiedas belonging to a particular taxon in the plurality of taxa and notpresent among the group of sequences corresponding to that taxon can beadded to the group of sequences corresponding to that taxon for use inlater sequence comparisons. In some embodiments, determining thepresence, absence, or abundance of a taxon in the plurality of taxacomprises resolving a tie between two possible taxa to which asequencing read corresponds, wherein resolving the tie comprisesdetermining a sum of k-mer weights for the reference sequence along eachbranch of a phylogenetic tree. In some cases, a particular individual isidentified as the sample source.

The database of reference sequences can comprise any of a variety ofreference sequences. In some embodiments, the reference sequences arefrom one or more of bacteria, archaea, chromalveolata, viruses, fungi,plants, fish, amphibians, reptiles, birds, mammals, and humans. In somecases, the database of reference sequences consists of sequences from areference individual or a reference sample source. In this case, themethod may further comprise identifying the polynucleotides from thesample source as being derived from the reference individual or thereference sample source. In some embodiments, the database of referencesequences comprises k-mers having one or more mutations with respect toknown polynucleotide sequences, such that a plurality of variants of theknown polynucleotide sequences are represented in the database ofreference sequences. The database of reference sequences can comprisemarker gene sequences for taxonomic classification of bacterialsequences, such as 16S rRNA sequences. In some embodiments, the databaseof reference sequences comprises sequences of human transcripts.

In some embodiments, the database of reference sequences consists ofsequences associated with a condition. One or more such sequences mayform a biosignature for the condition, a plurality of which may togetherform the reference database. In some cases, the record database isassociated with a condition of the sample source to establish abiosignature for the condition. When sequences are associated with acondition, the method may further comprise identifying a condition ofthe sample source by comparison of the record database to abiosignature, including identifying the sample source as having thecondition. The condition may be contamination, such as foodcontamination, surface contamination, or environmental contamination. Insome embodiments, the condition is infection.

Biosignatures (e.g. of infection) can comprise (i) sequences of hosttranscript or levels of sequences of host transcripts; and/or (ii)sequences of one or more infectious agents. In some embodiments, theinfection is influenza and the biosignature consists of sequences of oneor more of IFIT1, IFI6, IFIT2, ISG15, OASL, IFIT3, NT5C3A, MX2, IFITM1,CXCL10, IFI44L, MX1, IFIH1, OAS2, SAMD9, RSAD2, DDX58. The sample sourcecan be any of a variety of sample sources. In some cases, the samplesource is a subject. Where sequences are associated with a condition,the method may further comprise monitoring treatment in an infectedsubject by identifying the presence or absence of the biosignature insamples from the infected subject at multiple times after beginningtreatment. Treatment of the infected subject can be adjusted based onresults of the monitoring.

In some embodiments, methods of the present disclosure compriseselecting, and optionally taking, medical action based on the results ofsequence and/or taxa identification. For example, medical action cancomprise administering a pharmaceutical composition, such as anantibiotic. In some embodiments, the antibiotic is selected based onefficacy against taxa identified in the sample.

In some embodiments, the database of reference sequences comprisespolynucleotide sequences reverse-translated from amino acid sequences.Reverse-translating can use a non-degenerate code comprising a singlecodon for each amino acid. Where a non-degenerate code is used, asequencing read can be translated to an amino acid sequence and thenreverse-translated using the non-degenerate code prior to comparisonwith the reverse-translated reference sequences.

In some embodiments, the k-mer weight relates a count of a particulark-mer within a particular reference sequence, a count of the particulark-mer among a group of sequences comprising the reference sequence, anda count of the particular k-mer among all reference sequences in thedatabase of reference sequences. In some embodiments, step (b) iscompleted for 20,000 sequencing reads in less than 1.5 seconds. The20,000 sequencing reads can comprise sequences from two or more ofbacteria, viruses, fungi, and humans. In some embodiments, steps (a)-(c)are performed by a computer system in response to a user request. Insome embodiments, the user uploads the sequencing reads to the computersystem, and the method is performed concurrently with the upload. Insome embodiments, the user uploads a plurality of sequencing reads tothe computer system, and results of the sequence analysis are reportedto the user for one or more of the plurality of sequencing reads whileother sequencing reads of the plurality of sequencing reads areuploading. For example, a sequencing file containing a plurality ofsequencing reads may be broken into smaller components (e.g. subsets ofone or more sequencing reads), and components uploaded first may beanalyzed and reported while the remainder of the file continues toupload. The computer system may be remote with respect to the user. Themethod can further comprise sequencing the plurality of polynucleotidesfrom the sample to generate the sequencing reads.

In one aspect, the present disclosure provides a method of detecting aplurality of taxa in a sample. In some embodiments, the methodcomprising providing sequencing reads for a plurality of polynucleotidesfrom the sample, and for each sequencing read: (a) assigning thesequencing read to a first taxonomic group based on a first sequencecomparison between the sequencing read and a first plurality ofpolynucleotide sequences from the different first taxonomic groups,wherein at least two sequencing reads are assigned to differenttaxonomic groups; (b) performing with a computer system a secondsequence comparison between the sequencing read and a second pluralityof polynucleotide sequences corresponding to members of the firsttaxonomic group, wherein the comparison comprises counting a number ofk-mers within the sequencing read of at least 5 nucleotides in lengththat exactly match one or more k-mers within a reference sequence in thesecond plurality of polynucleotide sequences; (c) classifying thesequencing read as belonging to a second taxonomic group that is morespecific than the first taxonomic group if a measure of similaritybetween the sequencing read and reference sequence is above a firstthreshold level; (d) if no similarity above the first threshold level isidentified in (c), classifying the sequencing read as belonging to thesecond taxonomic group based on similarity above a second thresholdlevel determined by comparing with the computer system a sequencederived from translating the sequencing read and a third set ofreference sequences corresponding to amino acid sequences of members ofthe first taxonomic group; and (e) identifying the presence, absence, orabundance of the plurality of taxa in the sample based on theclassifying of the sequencing reads. Step (b) may further comprisecalculating k-mer weights as measures of how likely it is that k-merswithin the sequencing read are derived from a reference sequence in thesecond plurality of polynucleotide sequences. In some embodiments, thethird set of reference sequences consist of polynucleotide sequencesderived from reverse-translating the corresponding amino acid sequences.The method can further comprise performing with the computer system arelaxed sequence comparison between the sequencing read and the secondplurality of polynucleotide sequences if the similarity in (d) is belowthe second threshold, wherein the relaxed sequence comparison is lessstringent than the second sequence comparison. In some embodiments,classifying the sequencing read in step (c) comprises resolving a tiebetween two or more possible taxonomic groups based on a k-mer weight asa measure of how likely it is that the sequencing read corresponds to apolynucleotide from an ancestor of one of the possible taxonomic groups.In some embodiments, step (a) comprises assigning sequencing reads totwo or more taxa selected from bacteria, viruses, fungi, or humans. Insome embodiments, a sequencing read classified as belonging to thesecond taxonomic group and not present among the group of sequencescorresponding to the second taxonomic group is added to the group ofsequences corresponding to the second taxonomic group for use in latersequence comparisons. The second plurality of nucleotide sequences maycomprise marker gene sequences for taxonomic classification of bacterialsequences, such as 16S rRNA sequences. The second plurality ofnucleotide sequences may comprise sequences of human transcripts.

In some embodiments, the method further comprises diagnosing a conditionbased on a degree of similarity between the plurality of taxa detectedin the sample and a biological signature for the condition. Thecondition can be contamination of the sample, or infection of a subject.When the condition is infection of a subject, the infection can beassessed based on the presence or amount of (i) sequences of hosttranscripts; and/or (ii) sequences of one or more infectious agents. Themethod can further comprise monitoring treatment in an infected subjectby detecting presence, absence, or abundance of a plurality of taxa insamples from the infected subject at multiple times after beginningtreatment, and optionally changing treatment of the infected subjectbased on results of the monitoring. The method may further compriseclassifying the sequencing read as corresponding to a gene transcript ifthe measure of similarity between the sequencing read and referencesequence is above the first threshold level. Where a sequencing read isclassified as corresponding to a gene transcript, the method may furthercomprise diagnosing a condition based on a degree of similarity betweenthe plurality of taxa detected in the sample and a biological signaturefor the condition.

In one aspect, the disclosure provides systems for performing any of themethods described herein. In some embodiments, the system is configuredfor identifying a plurality of polynucleotides in a sample from a samplesource based on sequencing reads for the plurality of polynucleotides.For example, the system may comprise one or more computer processorsprogrammed to, for each sequencing read: (a) perform a sequencecomparison between the sequencing read and a plurality of referencepolynucleotide sequences, wherein the comparison comprises calculatingk-mer weights as measures of how likely it is that k-mers within thesequencing read are derived from a reference sequence within theplurality of reference polynucleotide sequences; (b) identify thesequencing read as corresponding to a particular reference sequence in adatabase of reference sequences if the sum of k-mer weights for thereference sequence is above a threshold level; and (c) assemble a recorddatabase comprising reference sequences identified in step (b), whereinthe record database excludes reference sequences to which no sequencingread corresponds. The system may further comprise a reaction module incommunication with the computer processor, wherein the reaction moduleperforms polynucleotide sequencing reactions to produce the sequencingreads.

In some embodiments, the system is configured for identifying one ormore taxa in a sample from a sample source based on sequencing reads fora plurality of polynucleotides. For example, the system may comprise oneor more computer processors programmed to: (a) for each sequencing read,perform a sequence comparison between the sequencing read and aplurality of reference polynucleotide sequences, wherein the comparisoncomprises calculating k-mer weights as measures of how likely it is thatk-mers within the sequencing read are derived from a reference sequencewithin the plurality of reference polynucleotide sequences; (b) for eachsequencing read, calculate a probability that the sequencing readcorresponds to a particular reference sequence in a database ofreference sequences based on the k-mer weights, thereby generating asequence probability; (c) calculate a score for the presence or absenceof one or more taxa based on the sequence probabilities corresponding tosequences representative of said one or more taxa; and (d) identify theone or more taxa as present or absent in the sample based on thecorresponding scores. The system may further comprise a reaction modulein communication with the computer processor, wherein the reactionmodule performs polynucleotide sequencing reactions to produce thesequencing reads.

In one aspect, the disclosure provides a computer-readable mediumcomprising code that, upon execution by one or more processors,implements a method according to any of the methods disclosed herein. Insome embodiments, execution of the computer readable medium implements amethod of identifying a plurality of polynucleotides in a sample from asample source based on sequencing reads for the plurality ofpolynucleotides. In one embodiment, the execution of the computerreadable medium implements a method comprising: (a) for each of thesequencing reads, performing a sequence comparison between thesequencing read and a plurality of reference polynucleotide sequences,wherein the comparison comprises calculating k-mer weights as measuresof how likely it is that k-mers within the sequencing read are derivedfrom a reference sequence within the plurality of referencepolynucleotide sequences; (b) for each of the sequencing reads,identifying the sequencing read as corresponding to a particularreference sequence in a database of reference sequences if the sum ofk-mer weights for the reference sequence is above a threshold level; and(c) assembling a record database comprising reference sequencesidentified in step (b), wherein the record database excludes referencesequences to which no sequencing read corresponds.

In some embodiments, the execution of the computer readable mediumimplements a method of identifying one or more taxa in a sample from asample source based on sequencing reads for a plurality ofpolynucleotides, the method comprising: (a) for each of the sequencingreads, performing a sequence comparison between the sequencing read anda plurality of reference polynucleotide sequences, wherein thecomparison comprises calculating k-mer weights as a measure of howlikely it is that k-mers within the sequencing read are derived from areference sequence within the plurality of reference polynucleotidesequences; (b) for each of the sequencing reads, calculating aprobability that the sequencing read corresponds to a particularreference sequence in a database of reference sequences based on thek-mer weights, thereby generating a sequence probability; (c)calculating a score for the presence or absence of one or more taxabased on the sequence probabilities corresponding to sequencesrepresentative of said one or more taxa; and (d) identifying the one ormore taxa as present or absent in the sample based on the correspondingscores.

INCORPORATION BY REFERENCE

All publications, patents, and patent applications mentioned in thisspecification are herein incorporated by reference to the same extent asif each individual publication, patent, or patent application wasspecifically and individually indicated to be incorporated by reference.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth with particularity inthe appended claims. A better understanding of the features andadvantages of the present invention will be obtained by reference to thefollowing detailed description that sets forth illustrative embodiments,in which the principles of the invention are utilized, and theaccompanying drawings of which:

FIGS. 1A and 1B provide an overview of the structure and user interfaceof a system in accordance with an embodiment of the disclosure, referredto as Taxonomer.

FIGS. 2A, 2B, 2C, 2D, and 2E show performance of an embodiment ofTaxonomer's ‘Classifier’ module for bacterial and fungal classification,as well as bacterial community profiling. Numbers below the barsindicate the reads (%) represented by the bottom bar at thecorresponding position. Numbers above the bars indicate the reads (%) ofthe first bar above the bottom bar at the corresponding position.

FIGS. 3A, 3B, 3C, 3D, 3E, and 3F show performance characteristics of anembodiment of Taxonomer's ‘Protonomer’ module for virus detection.

FIGS. 4A, 4B, 4C, 4D, 4E, 4F, 4G, and 4H show performancecharacteristics of an embodiment of Taxonomer's ‘Classifier’ module forhost transcript expression profiling.

FIGS. 5A, 5B, 5C, and 5D illustrate results for detecting previouslyunrecognized infections or laboratory contamination and compatibilitywith commonly used sequencers.

FIGS. 6A, 6B, 6C, and 6D show performance characteristics of anembodiment of Taxonomer's ‘Binner’ module.

FIG. 7 illustrates results demonstrating an increase in accuracyachieved by calculating the number of k-mers shared between a sequencingread and each of the binning databases prior to read assignment(parallel approach).

FIGS. 8A, 8B, and 8C illustrate results for performance and sensitivityof an embodiment of Taxonomer's Protonomer module, RAPSearch2, andDiamond in placing viral reads in the correct taxonomic bin.

FIGS. 9A, 9B, and 9C show relative performances and sensitivities ofembodiments of Taxonomer's Protonomer, Afterburner, and the combinationof Protonomer/Afterburner.

FIG. 10 shows the effect of different confidence cut-offs of Krakencompared to an embodiment of Taxonomer and SURPI.

FIG. 11 illustrates results showing that query sequences not representedin the reference database cause false-positive and false-negativeclassifications, and that an embodiment of Taxonomer is less affected bythis than other tools.

FIGS. 12A and 12B show the read-level (top) and taxon-level (bottom)bacterial classification accuracy of BLAST, the RDP Classifier, Kraken,and an embodiment of Taxonomer.

FIG. 13 illustrates the impact of sequencing error rates on differentclassification methods.

FIGS. 14A, 14B, 14C, and 14D illustrate results showing that anembodiment of Taxonomer classifies bacterial 16S rRNA reads at >200-foldincreased speed compared to the RDP Classifier while providing highlycomparable bacterial community profiles.

FIG. 15 shows example analysis times for the RDP Classifier (R), anembodiment of Taxonomer (T), and Kraken (K) for classification ofsamples shown in FIGS. 14A-D.

FIGS. 16A and 16B illustrate results showing that an embodiment ofTaxonomer was able to correctly identify Elizabethkingia meningosepticain sample SAMN03015718 (SRR1564828) and Enterovirus A in plasma from apatient with suspected Ebola virus disease in Sierra Leone (SRR1564825).

FIG. 17 shows the phylogenetic tree of the consensus sequence of a novelAnellovirus with reference sequences for Torque teno mini viruses, asdetermined by an embodiment of Taxonomer.

FIG. 18 shows example processing times of an embodiment of Taxonomercompared to the classification pipelines SURPI and Kraken.

FIG. 19 provides example reference databases in accordance withembodiments of the disclosure.

FIG. 20 provides results of sequence comparisons performed in accordancewith embodiments of the disclosure.

FIGS. 21A, 21B, and 21C illustrate results of an example sequenceanalysis for microbial strain profiling in accordance with embodimentsof the disclosure.

FIGS. 22A, 22B, and 22C illustrate results of an example sequenceanalysis for microbial strain profiling in accordance with embodimentsof the disclosure. The y-axis presents the fraction of correctly-typedstrains.

DETAILED DESCRIPTION OF THE INVENTION

Throughout this application, various embodiments of this invention maybe presented in a range format. It should be understood that thedescription in range format is merely for convenience and brevity andshould not be construed as an inflexible limitation on the scope of theinvention. Accordingly, the description of a range should be consideredto have specifically disclosed all the possible subranges as well asindividual numerical values within that range. For example, descriptionof a range such as from 1 to 6 should be considered to have specificallydisclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numberswithin that range, for example, 1, 2, 3, 4, 5, and 6. This appliesregardless of the breadth of the range.

The systems and methods of this disclosure as described herein mayemploy, unless otherwise indicated, conventional techniques anddescriptions of molecular biology (including recombinant techniques),cell biology, biochemistry, microarray and sequencing technology, whichare within the skill of those who practice in the art. Such conventionaltechniques include polymer array synthesis, hybridization and ligationof oligonucleotides, sequencing of oligonucleotides, and detection ofhybridization using a label. Specific illustrations of suitabletechniques can be had by reference to the examples herein. However,equivalent conventional procedures can, of course, also be used. Suchconventional techniques and descriptions can be found in standardlaboratory manuals such as Green, et al., Eds., Genome Analysis: ALaboratory Manual Series (Vols. I-IV) (1999); Weiner, et al., Eds.,Genetic Variation: A Laboratory Manual (2007); Dieffenbach, Dveksler,Eds., PCR Primer: A Laboratory Manual (2003); Bowtell and Sambrook, DNAMicroarrays: A Molecular Cloning Manual (2003); Mount, Bioinformatics:Sequence and Genome Analysis (2004); Sambrook and Russell, CondensedProtocols from Molecular Cloning: A Laboratory Manual (2006); andSambrook and Russell, Molecular Cloning: A Laboratory Manual (2002) (allfrom Cold Spring Harbor Laboratory Press); Stryer, L., Biochemistry (4thEd.) W.H. Freeman, N.Y. (1995); Gait, “Oligonucleotide Synthesis: APractical Approach” IRL Press, London (1984); Nelson and Cox, Lehninger,Principles of Biochemistry, 3^(rd) Ed., W.H. Freeman Pub., New York(2000); and Berg et al., Biochemistry, 5^(th) Ed., W.H. Freeman Pub.,New York (2002), all of which are herein incorporated by reference intheir entirety for all purposes. Before the present compositions,research tools and systems and methods are described, it is to beunderstood that this disclosure is not limited to the specific systemsand methods, compositions, targets and uses described, as such may, ofcourse, vary. It is also to be understood that the terminology usedherein is for the purpose of describing particular aspects only and isnot intended to limit the scope of the present disclosure, which will belimited only by appended claims.

The term “about” or “approximately” means within an acceptable errorrange for the particular value as determined by one of ordinary skill inthe art, which will depend in part on how the value is measured ordetermined, i.e., the limitations of the measurement system. Forexample, “about” can mean within 1 or more than 1 standard deviation,per the practice in the art. Alternatively, “about” can mean a range ofup to 20%, up to 10%, up to 5%, or up to 1% of a given value.Alternatively, particularly with respect to biological systems orprocesses, the term can mean within an order of magnitude, preferablywithin 5-fold, and more preferably within 2-fold, of a value. Whereparticular values are described in the application and claims, unlessotherwise stated the term “about” meaning within an acceptable errorrange for the particular value should be assumed.

The terms “polynucleotide”, “nucleotide”, “nucleotide sequence”,“nucleic acid” and “oligonucleotide” are used interchangeably. Theyrefer to a polymeric form of nucleotides of any length, eitherdeoxyribonucleotides or ribonucleotides, or analogs thereof.Polynucleotides may have any three dimensional structure, and mayperform any function, known or unknown. The following are non-limitingexamples of polynucleotides: coding or non-coding regions of a gene orgene fragment, loci (locus) defined from linkage analysis, exons,introns, messenger RNA (mRNA), transfer RNA (tRNA), ribosomal RNA(rRNA), short interfering RNA (siRNA), short-hairpin RNA (shRNA),micro-RNA (miRNA), ribozymes, cDNA, recombinant polynucleotides,branched polynucleotides, plasmids, vectors, isolated DNA of anysequence, isolated RNA of any sequence, nucleic acid probes, andprimers. A polynucleotide may comprise one or more modified nucleotides,such as methylated nucleotides and nucleotide analogs. If present,modifications to the nucleotide structure may be imparted before orafter assembly of the polymer. The sequence of nucleotides may beinterrupted by non-nucleotide components. A polynucleotide may befurther modified after polymerization, such as by conjugation with alabeling component.

“Complementarity” refers to the ability of a nucleic acid to formhydrogen bond(s) with another nucleic acid sequence by eithertraditional Watson-Crick or other non-traditional types. A percentcomplementarity indicates the percentage of residues in a nucleic acidmolecule which can form hydrogen bonds (e.g., Watson-Crick base pairing)with a second nucleic acid sequence (e.g., 5, 6, 7, 8, 9, 10 out of 10being 50%, 60%, 70%, 80%, 90%, and 100% complementary, respectively).“Perfectly complementary” means that all the contiguous residues of anucleic acid sequence will hydrogen bond with the same number ofcontiguous residues in a second nucleic acid sequence. “Substantiallycomplementary” as used herein refers to a degree of complementarity thatis at least 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, 98%, 99%, or100% over a region of 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,21, 22, 23, 24, 25, 30, 35, 40, 45, 50, or more nucleotides, or refersto two nucleic acids that hybridize under stringent conditions. Sequenceidentity, such as for the purpose of assessing percent complementarity,may be measured by any suitable alignment algorithm, including but notlimited to the Needleman-Wunsch algorithm (see e.g. the EMBOSS Needlealigner available atwww.ebi.ac.uk/Tools/psa/emboss_needle/nucleotide.html, optionally withdefault settings), the BLAST algorithm (see e.g. the BLAST alignmenttool available at blast.ncbi.nlm.nih.gov/Blast.cgi, optionally withdefault settings), or the Smith-Waterman algorithm (see e.g. the EMBOSSWater aligner available atwww.ebi.ac.uk/Tools/psa/emboss_water/nucleotide.html, optionally withdefault settings). Optimal alignment may be assessed using any suitableparameters of a chosen algorithm, including default parameters.

As used herein, “expression” refers to the process by which apolynucleotide is transcribed from a DNA template (such as into and mRNAor other RNA transcript) and/or the process by which a transcribed mRNAis subsequently translated into peptides, polypeptides, or proteins.Transcripts and encoded polypeptides may be collectively referred to as“gene product.” If the polynucleotide is derived from genomic DNA,expression may include splicing of the mRNA in a eukaryotic cell.

“Differentially expressed,” as applied to nucleotide sequence orpolypeptide sequence in a subject, refers to over-expression orunder-expression of that sequence when compared to that detected in acontrol. Underexpression also encompasses absence of expression of aparticular sequence as evidenced by the absence of detectable expressionin a test subject when compared to a control.

The terms “polypeptide”, “peptide” and “protein” are usedinterchangeably herein to refer to polymers of amino acids of anylength. The polymer may be linear or branched, it may comprise modifiedamino acids, and it may be interrupted by non amino acids. The termsalso encompass an amino acid polymer that has been modified; forexample, disulfide bond formation, glycosylation, lipidation,acetylation, phosphorylation, or any other manipulation, such asconjugation with a labeling component. As used herein the term “aminoacid” includes natural and/or unnatural or synthetic amino acids,including glycine and both the D or L optical isomers, and amino acidanalogs and peptidomimetics.

A “control” is an alternative subject or sample used in an experimentfor comparison purpose.

The terms “subject,” “individual,” and “patient” are usedinterchangeably herein to refer to a vertebrate, preferably a mammal,more preferably a human. Mammals include, but are not limited to,murines, simians, humans, farm animals, sport animals, and pets.Tissues, cells, and their progeny of a biological entity obtained invivo or cultured in vitro are also encompassed.

The terms “determining”, “measuring”, “evaluating”, “assessing,”“assaying,” and “analyzing” can be used interchangeably herein to referto any form of measurement, and include determining if an element ispresent or not (for example, detection). These terms can include bothquantitative and/or qualitative determinations. Assessing may berelative or absolute. “Detecting the presence of” can includedetermining the amount of something present, as well as determiningwhether it is present or absent.

The term specificity, or true negative rate, can refer to a test'sability to exclude a condition correctly. For example, in aclassification algorithm, the specificity of the algorithm may refer tothe proportion of reads known not to be from an organism in a giventaxonomic bin, which will not be placed in the taxonomic bin. In somecases, this is calculated by determining the proportion of truenegatives (reads not placed in the bin that are not from the taxonomicbin) to the total number of reads that are not derived from an organismwithin the taxonomic bin (the sum of the reads that are not placed in agiven taxonomic bin and are not derived from an organism within thattaxonomic bin and reads that are placed in that taxonomic bin that arenot derived from an organism within that taxonomic bin).

The term sensitivity, or true positive rate, can refer to a test'sability to identify a condition correctly. For example, in aclassification algorithm, the sensitivity of a test may refer to theproportion of reads known to be from an organism in a given taxonomicbin, which will be placed in the taxonomic bin. In some cases, this iscalculated by determining the proportion of true positives (reads placedin the bin that are from the taxonomic bin) to the total number of readsthat are derived from an organism within the taxonomic bin (the sum ofthe reads that are placed in a given taxonomic bin and are derived froman organism within that taxonomic bin and reads that are not placed inthat taxonomic bin that are derived from an organism within thattaxonomic bin).

The quantitative relationship between sensitivity and specificity canchange as different classification cut-offs are chosen. This variationcan be represented using ROC curves. The x-axis of a ROC curve shows thefalse-positive rate of an assay, which can be calculated as(1−specificity). The y-axis of a ROC curve reports the sensitivity foran assay. This allows one to determine a sensitivity of an assay for agiven specificity, and vice versa.

As used here, the term “adaptor” or “adapter” are used interchangeablyand can refer to an oligonucleotide that may be attached to the end of anucleic acid. Adaptor sequences may comprise, for example, primingsites, the complement of a priming site, recognition sites forendonucleases, common sequences and promoters. Adaptors may alsoincorporate modified nucleotides that modify the properties of theadaptor sequence. For example, phosphorothioate groups may beincorporated in one of the adaptor strands.

The terms “taxon” (plural “taxa”), “taxonomic group,” and “taxonomicunit” are used interchangeably to refer to a group of one or moreorganisms that comprises a node in a clustering tree. The level of acluster is determined by its hierarchical order. In one embodiment, ataxon is a group tentatively assumed to be a valid taxon for purposes ofphylogenetic analysis. In another embodiment, a taxon is any of theextant taxonomic units under study. In yet another embodiment, a taxonis given a name and a rank. For example, a taxon can represent a domain,a sub-domain, a kingdom, a sub-kingdom, a phylum, a sub-phylum, a class,a sub-class, an order, a sub-order, a family, a subfamily, a genus, asubgenus, or a species. In some embodiments, taxa can represent one ormore organisms from the kingdoms eubacteria, protista, or fungi at anylevel of a hierarchal order.

In general, “sequence identity” refers to an exactnucleotide-to-nucleotide or amino acid-to-amino acid correspondence oftwo polynucleotides or polypeptide sequences, respectively. Typically,techniques for determining sequence identity include determining thenucleotide sequence of a polynucleotide and/or determining the aminoacid sequence encoded thereby, and comparing these sequences to a secondnucleotide or amino acid sequence. Two or more sequences (polynucleotideor amino acid) can be compared by determining their “percent identity.”The percent identity of two sequences, whether nucleic acid or aminoacid sequences, is the number of exact matches between two alignedsequences divided by the length of the shorter sequences and multipliedby 100. Percent identity may also be determined, for example, bycomparing sequence information using the advanced BLAST computerprogram, including version 2.2.9, available from the National Institutesof Health. The BLAST program is based on the alignment method of Karlinand Altschul, Proc. Natl. Acad. Sci. USA 87:2264-2268 (1990) and asdiscussed in Altschul, et al., J. Mol. Biol. 215:403-410 (1990); KarlinAnd Altschul, Proc. Natl. Acad. Sci. USA 90:5873-5877 (1993); andAltschul et al., Nucleic Acids Res. 25:3389-3402 (1997). Briefly, theBLAST program defines identity as the number of identical alignedsymbols (i.e., nucleotides or amino acids), divided by the total numberof symbols in the shorter of the two sequences. The program may be usedto determine percent identity over the entire length of the proteinsbeing compared. Default parameters are provided to optimize searcheswith short query sequences in, for example, with the blastp program. Theprogram also allows use of an SEG filter to mask-off segments of thequery sequences as determined by the SEG program of Wootton andFederhen, Computers and Chemistry 17:149-163 (1993). Ranges of desireddegrees of sequence identity are approximately 80% to 100% and integervalues therebetween. In general, an exact match indicates 100% identityover the length of the shortest of the sequences being compared (or overthe length of both sequences, if identical).

In one aspect, the disclosure provides a method of identifying aplurality of polynucleotides in a sample source. In some embodiments,the method comprises providing sequencing reads for a plurality ofpolynucleotides from the sample, and for each sequencing read: (a)performing with a computer system a sequence comparison between thesequencing read and a plurality of reference polynucleotide sequences,wherein the comparison comprises calculating k-mer weights as a measureof how likely it is that k-mers within the sequencing read are derivedfrom a reference sequence within the plurality of referencepolynucleotide sequences; (b) identifying the sequencing read ascorresponding to a particular reference sequence in a database ofreference sequences if the sum of k-mer weights for the referencesequence is above a threshold level; and (c) assembling a recorddatabase comprising reference sequences identified in step (b), whereinthe record database excludes reference sequences to which no sequencingread corresponds.

In another aspect, the disclosure provides a method of identifying oneor more taxa in a sample from a sample source. In some embodiments, themethod comprises (a) providing sequencing reads for a plurality ofpolynucleotides from the sample, and for each sequencing read: (i)performing with a computer system a sequence comparison between thesequencing read and a plurality of reference polynucleotide sequences,wherein the comparison comprises calculating k-mer weights as a measureof how likely it is that k-mers within the sequencing read are derivedfrom a reference sequence within the plurality of referencepolynucleotide sequences; and (ii) calculating a probability that thesequencing read corresponds to a particular reference sequence in adatabase of reference sequences based on the k-mer weights, therebygenerating a sequence probability; (b) calculating a score for thepresence or absence of one or more taxa based on the sequenceprobabilities corresponding to sequences representative of said one ormore taxa; and (c) identifying the one or more taxa as present or absentin the sample based on the corresponding scores. In some cases, the oneor more taxa comprises a first bacterial strain identified as presentand a second bacterial strain identified as absent based on one or morenucleotide differences in sequence. In some cases, the first bacterialstrain is identified as present and the second bacterial strain isidentified as absent based on a single nucleotide difference insequence.

In general, a sequencing read (also referred to as a “read” or “querysequence”) refers to the inferred sequence of nucleotide bases in anucleic acid molecule. A sequencing read may be of any appropriatelength, such as about or more than about 20 nt, 30 nt, 36 nt, 40 nt, 50nt, 75 nt, 100 nt, 150 nt, 200 nt, 250 nt, 300 nt, 400 nt, 500 nt, ormore in length. In some embodiments, a sequencing read is less than 200nt, 150 nt, 100 nt, 75 nt, or fewer in length. Sequencing reads can be“paired,” meaning that they are derived from different ends of a nucleicacid fragment. Paired reads can have intervening unknown sequence oroverlap. In some cases, the sequencing read is a contig or consensussequence assembled from separate overlapping reads. A sequencing readmay be analyzed in terms of component k-mers. In general, “k-mer” refersto the subsequences of a given length k that make up a sequencing read.For example, a the sequence “AGCTCT” can be divided into the 3-ntsubsequences “AGC,” “GCT,” “CTC,” and “TCT.” In this example, each ofthese subsequences is a k-mer, wherein k=3. K-mers may be overlapping ornon-overlapping.

Sequence comparison may comprise one or more comparison steps in whichone or more k-mers of a sequencing read are compared to k-mers of one ormore reference sequences (also referred to simply as a “reference”). Insome embodiments, a k-mer is about or more than about 3 nt, 4 nt, 5 nt,6 nt, 7 nt, 8 nt, 9 nt, 10 nt, 11 nt, 12 nt, 13 nt, 14 nt, 15 nt, 16 nt,17 nt, 18 nt, 19 nt, 20 nt, 25 nt, 30 nt, 35 nt, 40 nt, 45 nt, 50 nt, 75nt, 100 nt, or more in length. In some embodiments, a k-mer is about orless than about 30 nt, 25 nt, 20 nt, 15 nt, 10 nt, or fewer in length.The k-mer may be in the range of 3 nt to 13 nt, 5 nt to 25 nt in length,7 nt to 99 nt, or 3 nt to 99 nt in length. The length of k-mer analyzedat each step may vary. For example, a first comparison may comparek-mers in a sequencing read and a reference sequence that are 21 nt inlength, whereas a second comparison may compare k-mers in a sequencingread and a reference sequence that are 7 nt in length. For any givensequence in a comparison step, k-mers analyzed may be overlapping (suchas in a sliding window), and may be of same or different lengths. Whilek-mers are generally referred to herein as nucleic acid sequences,sequence comparison also encompasses comparison of polypeptidesequences, including comparison of k-mers consisting of amino acids.

A reference sequence includes any sequence to which a sequencing read iscompared. Typically, the reference sequence is associated with someknown characteristic, such as a condition of a sample source, ataxonomic group, a particular species, an expression profile, aparticular gene, an associated phenotype such as likely diseaseprogression, drug resistance or pathogenicity, increased or reducedpredisposition to disease, or other characteristic. Typically, areference sequence is one of many such reference sequences in adatabase. A variety of databases comprising various types of referencesequences are available, one or more of which may serve as a referencedatabase either individually or in various combinations. Databases cancomprise many species and sequence types, such as NR, UniProt,SwissProt, TrEMBL, or UniRef90. Databases can comprise specific kinds ofsequences from multiple species, such as those used for taxonomicclassification of species, such as bacteria. Such databases can be 16Sdatabases, such as The Greengenes database, the UNITE database, or theSILVA database. Marker genes other than 16S may be used as referencesequences for the identification of microorganisms (e.g. bacteria), suchas metabolic genes, genes encoding structural proteins, proteins thatcontrol growth, cell cycle or reproductive regulation, housekeepinggenes or genes that encode virulence, toxins, or other pathogenicfactors. Specific examples of marker genes include, but are not limitedto, 18S rDNA, 23 S rDNA, gyrA, gyrB gene, groEL, rpoB gene, fusA gene,recA gene, sod A, cox1 gene, and nifD gene. Reference databases cancomprise internal transcribed sequences (ITS) databases, such as UNITE,ITSoneDB, or ITS2. Databases can comprise multiple sequences from asingle species, such as the human genome, the human transcriptome, modelorganisms such as the mouse genome, the yeast transcriptome, or the C.elegans proteome, or disease vectors such as bat, tick, or mosquitoesand other domestic and wild animals. In some embodiments, the referencedatabase comprises sequences of human transcripts. Reference sequencesin databases can comprise DNA sequences, RNA sequences, or proteinsequences. Reference sequences in databases can comprise sequences froma plurality of taxa. In some cases, the reference sequences are from areference individual or a reference sample source. Examples of referenceindividual genomes are, for example, a maternal genome, a paternalgenome, or the genome of a non-cancerous tissue sample. Examples ofreference individuals or sample sources are the human genome, the mousegenome, or the genomes of particular serovars, genovars, strains,variants or otherwise characterized types of bacteria, archea, viruses,phages, fungi, and parasites. The database can comprise polymorphicreference sequences that contain one or more mutations with respect toknown polynucleotide sequences. Such polymorphic reference sequences canbe different alleles found in the population, such as SNPs, indels,microdeletions, microexpansions, common rearrangements, geneticrecombinations, or prophage insertion sites, and may contain informationon their relative abundance compared to non-polymorphic sequences.Polymorphic reference sequences may also be artificially generated fromthe reference sequences of a database, such as by varying one or more(including all) positions in a reference genome such that a plurality ofpossible mutations not in the actual reference database are representedfor comparison. The database of reference sequences can comprisereference sequences of one or more of a variety of different taxonomicgroups, including but not limited to bacteria, archaea, chromalveolata,viruses, fungi, plants, fish, amphibians, reptiles, birds, mammals, andhumans. In some cases, the database of reference sequences consists ofsequences from one or more reference individuals or a reference samplesources (e.g. 10, 100, 1000, 10000, 100000, 1000000, or more), and eachreference sequence in the database is associated with its correspondingindividual or sample source. In some embodiments, an unknown sample maybe identified as originating from an individual or sample sourcerepresented in the reference database on the basis of a sequencecomparison.

In some embodiments, each reference sequence in the database ofreference sequences is associated with, prior to the comparison, a k-merweight as a measure of how likely it is that a k-mer within thereference sequence originates from the reference sequence.Alternatively, the database of reference sequences can comprisesequences from a plurality of taxa, and each reference sequence in thedatabase of reference sequences is associated with a k-mer weight as ameasure of how likely it is that a k-mer within the reference sequenceoriginates from a taxon within the plurality of taxa. Calculating thek-mer weight can comprise comparing a reference sequence in the databaseto the other reference sequences in the database, such as by a methoddescribed herein. The k-mer values thus associated with sequences ortaxa in the database may then be used in determining k-mer weights fork-mers within sequencing reads.

In general, comparing k-mers in a read to a reference sequence comprisescounting k-mer matches between the two. The stringency for identifying amatch may vary. For example, a match may be an exact match, in which thenucleotide sequence of the k-mer from the read is identical to thenucleotide sequence of the k-mer from the reference. Alternatively, amatch may be an incomplete match, where 1, 2, 3, 4, 5, 10, or moremismatches are permitted. In addition to counting matches, a likelihood(also referred to as a “k-mer weight” or “KW”) can be calculated. Insome embodiments, the k-mer weight relates a count of a particular k-merwithin a particular reference sequence, a count of the particular k-meramong a group of sequences comprising the reference sequence, and acount of the particular k-mer among all reference sequences in thedatabase of reference sequences. In one embodiment, the k-mer weight iscalculated according to the following formula, which calculates thek-mer weight as a measure of how likely it is that a particular k-mer(K_(i)) originates from a reference sequence (ref_(i)) as follows:

$\begin{matrix}{{KWre{f_{i}\left( K_{i} \right)}} = \frac{{C_{ref}\left( K_{i} \right)}/{C_{db}\left( K_{i} \right)}}{{{C_{db}\left( K_{i} \right)}/{Total}}{kmer}{count}}} & \left( {{Eqn}.1} \right)\end{matrix}$

C represents a function that returns the count of K_(i). C_(ref)(K_(i))indicates the count of the K_(i) in a particular reference.C_(db)(K_(i)) indicates the count of K_(i) in the database. This weightprovides a relative, database specific measure of how likely it is thata k-mer originated from a particular reference. Prior to comparing asequencing read to the database of reference sequences, the k-mer weight(or measurement of likelihood that a k-mer originates from a givenreference sequence) can be calculated for each k-mer and referencesequence in the database. In some cases, when a reference databasescomprises sequences from a plurality of taxa, each reference sequencecan be associated with a measure of likelihood, or k-mer weight, that ak-mer within the reference sequence originates from a taxon within aplurality of taxa. As a non-limiting example, a reference database cancomprise sequences from multiple species of canines, and the k-merweight could be calculated by relating the count of a given k-mer in allcanine sequences to its count in the entire database, which includesother taxa. In some examples, the k-mer weight measuring how likely itis that a k-mer originates from a specific taxon is calculated bydefining C_(ref)(K_(i)) in the above equation as a function that returnsthe total count of K_(i) in a particular taxon.

For each reference sequence, reference database derived weights for aplurality of k-mers within a sequencing read may be added and comparedto a threshold value. The threshold value can be specific to thecollection of reference sequences in the database and may be selectedbased on a variety of factors, such as average read length, whether aspecific sequence or source organism is to be identified as present inthe sample, and the like. If the sum of k-mer weights for the referencesequence is above the threshold level, the sequencing read may beidentified as corresponding to the reference sequence, and optionallythe organism or taxonomic group associated with the reference sequence.In some cases, the read is assigned to the reference sequence with themaximum sum of k-mer weights, which may or may not be required to beabove a threshold. In the case of a tie, where a sequence read has anequal likelihood of belonging to more than one reference sequence asmeasured by k-mer weight, the sequence read can be assigned to thetaxonomic lowest common ancestor (LCA) taking into account the read'stotal k-mer weight along each branch of the phylogenetic tree. Ingeneral, correspondence with a reference sequence, organism, ortaxonomic group indicates that it was present in the sample.

In some aspects, the methods comprise calculating a probability. In somecases, a probability is calculated for a sequencing read generated froma plurality of polynucleotides. In some cases, the probability is theprobability (or likelihood) that the sequencing read corresponds to aparticular reference sequence in a database of reference sequences basedon the k-mer weights. A probability may be calculated for eachsequencing read, thereby generating a plurality of sequenceprobabilities. In some cases, the presence or absence of one or moretaxa in a sample may be determined based on the sequence probabilities.For example, the probability may identify a first bacterial strain asbeing present in the sample and a second bacterial strain as beingabsent in the sample. In some cases, the probability is represented as apercentage (%) or as a fraction. In some cases, a probability isprovided as a score representative of the probability. The score can bebased on any arbitrary scale so long as the score is indicative of theprobability (e.g. a probability that an individual sequence correspondsto a particular reference sequence, or a probability that a particulartaxon is present in the sample). The probability or a scorerepresentative of the probability may be used to determine the presenceor absence of one or more taxa within a sample. For example, aprobability or score above a threshold value may be indicative ofpresence, and/or a probability or score below a threshold value may beindicative of absence. In some embodiments, presence or absence isreported as a probability, rather than an absolute call. Example methodsfor calculating such probabilities are provided herein. In general,embodiments described herein in terms of presence or absence likewiseencompass calculating a probability or score for such presence orabsence.

Results of methods described herein will typically be assembled in arecord database. In some embodiments, the record database comprisesreference sequences identified as present in the sample and excludesreference sequences to which no sequencing read was found to correspond,such as by failure to match a sequencing read above a set thresholdlevel. The software routines used to generate the sequence recorddatabase and to compare sequencing reads to the database can be run on acomputer. The comparison can be performed automatically upon receivingdata. The comparison can be performed in response to a user request. Theuser request can specify which reference database to compare the sampleto. The computer can comprise one or more processors. Processors may beassociated with one or more controllers, calculation units, and/or otherunits of a computer system, or implanted in firmware as desired. Ifimplemented in software, the routines may be stored in any computerreadable memory, such as in RAM, ROM, flash memory, a magnetic disk, alaser disk, or other storage medium. The record database, sequencingreads, or a report summarizing the results of database construction orsequence read comparison may also be stored in any suitable medium, suchas in RAM, ROM, flash memory, a magnetic disk, a laser disk, or otherstorage medium. Likewise, the record database, sequencing reads, or areport summarizing the results of database construction or sequence readcomparison may be delivered to a computing device via any known deliverymethod including, for example, over a communication channel such as atelephone line, the internet, a wireless connection, etc., or via atransportable medium, such as a computer readable disk, flash drive,etc. . . . . A database, sequencing reads, or report may be communicatedto a user at a local or remote location using any suitable communicationmedium. For example, the communication medium can be a networkconnection, a wireless connection, or an internet connection. A databaseor report can be transmitted over such networks or connections (or anyother suitable means for transmitting information, including but notlimited to mailing database summary, such as a print-out) for receptionand/or for review by a user. The recipient can be but is not limited tothe customer, an individual, a health care provider, a health caremanager, or electronic system (e.g. one or more computers, and/or one ormore servers). In some embodiments, the database or report generatorsends the report to a recipient's device, such as a personal computer,phone, tablet, or other device. The database or report may be viewedonline, saved on the recipient's device, or printed. The comparison ofcommunicated sequencing reads to a database can occur after all thereads are uploaded. The comparison of communicated sequencing reads to adatabase can begin while the sequencing reads are in the process ofbeing uploaded.

One or more steps of a method described herein may be performed inparallel for each of the plurality of sequencing reads. For example,each of the sequencing reads in the plurality may be subjected inparallel to a first sequence comparison between the sequencing read anda plurality of reference polynucleotide sequences (e.g. referencepolynucleotide sequences from a plurality of different taxa and/or aplurality of different reference databases). Comparison in paralleldiffers from certain stepwise comparison processes in that sequencingreads having a purported match in a first reference database are notsubtracted from the query set of sequences for subsequent comparisonwith a second reference database. In such a stepwise process, sequenceshaving a purported match in the first database may be incorrectlyidentified before comparison being run against a reference databasecontaining a more accurate match (e.g. the correct sequence). Instead,by running a comparison against a plurality of different referencesequences corresponding to a plurality of different taxa, each sequencecan be assigned to an optimal first taxonomic class prior to identifyingwith greater specificity a sequence or taxon to which a sequencing readcorresponds. For example, sequencing reads may be first classified ascorresponding to human, bacterial, or fungal sequences beforeidentifying a particular gene, bacterial species, or fungal species towhich the sequencing read corresponds. In some instances, this processis referred to as “binning.” Parallel sequence comparison may comprisecomparison with sequences from two or more different taxonomic groups,such as 3, 4, 5, 6, or more different taxonomic groups. In someembodiments, the different taxonomic groups may be selected from two ormore of the following bacteria, archaea, chromalveolata, viruses, fungi,plants, fish, amphibians, reptiles, birds, mammals, and humans.

In some embodiments, a method may further comprise quantifying an amountof polynucleotides corresponding to a reference sequence identified inan earlier step. Quantification can be based on a number ofcorresponding sequencing reads identified. This can include normalizingthe count by the total number of reads, the total number of readsassociated with sequences, the length of the reference sequence, or acombination thereof. Examples of such normalization include FPKM andRPKM, but may also include other methods that take into account therelative amount of reads in different samples, such as normalizingsequencing reads from samples by the median of ratios of observed countsper sequence. A difference in quantity between samples can indicate adifference between the two samples. The quantitation can be used toidentify differences between subjects, such as comparing the taxapresent in the microbiota of subjects with different diets, or toobserve changes in the same subject over time, such as observing thetaxa present in the microbiota of a subject before and after going on aparticular diet.

In some embodiments, a method may comprise determining the presence,absence, or abundance of specific taxa or nucleotide polymorphismswithin samples based on results of an earlier step. In this case, theplurality of reference polynucleotide sequences typically comprisegroups of sequences corresponding to individual taxa in the plurality oftaxa. In some cases, at least 50, 100, 250, 500, 1000, 5000, 10000,50000, 100000, 250000, 500000, or 1000000 different taxa are identifiedas absent or present (and optionally abundance, which may be relative)based on sequences analyzed by a method described herein. In some cases,this analysis is performed in parallel. In some embodiments, themethods, compositions, and systems of the present disclosure enableparallel detection of the presence or absence of a taxon in a communityof taxa, such as an environmental or clinical sample, when the taxonidentified comprises less than one per 10⁹, or one per 10⁶, or 0.05% ofthe total population of taxa in the source sample. In some cases,detection is based on sequencing reads corresponding to a polynucleotidethat is present at less than 0.01% of the total nucleic acid population.The particular polynucleotide may be at least 20%, 30%, 40%, 50%, 60%,70%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96% or 97% homologous toother nucleic acids in the population. In some cases, the particularpolynucleotide is less than 75%, 50%, 40%, 30%, 20%, or 10% homologousto other nucleic acids in the population. Determining the presence,absence, or abundance of specific taxa can comprise identifying anindividual subject as the source of a sample. For example, a referencedatabase may comprise a plurality of reference sequences, each of whichcorresponds to an individual organism (e.g. a human subject), withsequences from a plurality of different subject represented among thereference sequences. Sequencing reads for an unknown sample may then becompared to sequences of the reference database, and based onidentifying the sequencing reads in accordance with a described method,an individual represented in the reference database may be identified asthe sample source of the sequencing reads. In such a case, the referencedatabase may comprise sequences from at least 10², 10³, 10⁴, 10⁵, 10⁶,10⁷, 10⁸, 10⁹, or more individuals.

In some cases, a sequencing read does not have a match to a referencesequence at the level of a particular taxonomic group (e.g. at thespecies level), or at any taxonomic level. When no match is found, thecorresponding sequence may be added to a reference database on the basisof known characteristics. In some cases, when a sequence is identifiedas belonging to a particular taxon in the plurality of taxa, and is notpresent among the group of sequences corresponding to that taxon, it isadded to the group of sequences corresponding to the taxon for use inlater sequence comparisons. For example, if a bacterial genome isidentified as belonging to a particular taxon, such as a genus orfamily, but the genome comprises sequence that is not present in thesequences associated with that taxon, the bacterial genome can be addedto the sequence database. Likewise, if the sample is derived from aparticular source or condition, the sequencing read may be added to areference database of sequences associated with that source or conditionfor use in identifying future samples that share the same source orcondition. As a further example, a sequence that does not have a matchat a lower level but does have a match at a higher level, as identifiedaccording to a method described herein, may be assigned to that higherlevel while also adding the sequencing read to the plurality ofreference sequences that correspond to that taxonomic group. Referencedatabases so updated may be used in later sequence comparisons.

In determining the presence, absence or abundance of a taxon in aplurality of taxa (or polymorphism among a plurality of polymorphisms),two possible taxa may be tied for the assignment of a particularsequencing read. In such cases, the tie may be resolved. In one example,a tie is resolved by determining a sum of k-mer weights for thereference sequences along each branch of a phylogenetic tree connectingthe taxa. The sequencing read may then be assigned to the node connectedto the branch with the highest sum of k-mer weights.

A reference database can consist of sequences (and optionally abundancelevels of sequences) associated with one or more conditions. Multipleconditions may be represented by one or more sequences in the referencedatabase, such as 10, 50, 100, 1000, 10000, 100000, 1000000, or moreconditions. For example, a reference database may consist of thousandsof groups of sequences, each group of sequences being associated with adifferent bacterial contaminant, such that contamination of a sample byany of the represented bacteria may be detected by sequence comparisonaccording to a method of the disclosure. A condition can be anycharacteristic of a sample or source from which a sample is derived. Forexample, the reference database may consist of a set of genes that areassociated with contamination by microorganisms, infection of a subjectfrom which the sample is derived, or a host response to pathogens. Otherconditions include, but are not limited to, contamination (e.g.environmental contamination, surface contamination, food contamination,air contamination, water contamination, cell culture contamination),stimulus response (e.g. drug responder or non-responder, allergicresponse, treatment response), infection (e.g. bacterial infection,fungal infection, viral infection), disease state (e.g. presence ofdisease, worsening of disease, disease recovery), and a healthy state.

Where the reference database consists of sequences associated withinfectious disease or contamination, the sequences may be derived fromand associated with any of a variety of infectious agents. Theinfectious agent can be bacterial. Non-limiting examples of bacterialpathogens include Mycobacteria (e.g. M. tuberculosis, M. bovis, M.avium, M. leprae, and M. africanum), rickettsia, mycoplasma, chlamydia,and legionella. Other examples of bacterial infections include, but arenot limited to, infections caused by Gram positive bacillus (e.g.,Listeria, Bacillus such as Bacillus anthracis, Erysipelothrix species),Gram negative bacillus (e.g., Bartonella, Brucella, Campylobacter,Enterobacter, Escherichia, Francisella, Hemophilus, Klebsiella,Morganella, Proteus, Providencia, Pseudomonas, Salmonella, Serratia,Shigella, Vibrio and Yersinia species), spirochete bacteria (e.g.,Borrelia species including Borrelia burgdorferi that causes Lymedisease), anaerobic bacteria (e.g., Actinomyces and Clostridiumspecies), Gram positive and negative coccal bacteria, Enterococcusspecies, Streptococcus species, Pneumococcus species, Staphylococcusspecies, and Neisseria species. Specific examples of infectious bacteriainclude, but are not limited to: Helicobacter pyloris, Legionellapneumophilia, Mycobacteria tuberculosis, M. avium, M. intracellular e,M. kansaii, M. gordonae, Staphylococcus aureus, Neisseria gonorrhoeae,Neisseria meningitidis, Listeria monocytogenes, Streptococcus pyogenes(Group A Streptococcus), Streptococcus agalactiae (Group BStreptococcus), Streptococcus viridans, Streptococcus faecalis,Streptococcus bovis, Streptococcus pneumoniae, Haemophilus influenzae,Bacillus antracis, Erysipelothrix rhusiopathiae, Clostridium tetani,Enterobacter aerogenes, Klebsiella pneumoniae, Pasteurella multocida,Fusobacterium nucleatum, Streptobacillus moniliformis, Treponemapallidium, Treponema pertenue, Leptospira, Rickettsia, and Actinomycesisraelii, Acinetobacter, Bacillus, Bordetella, Borrelia, Brucella,Campylobacter, Chlamydia, Chlamydophila, Clostridium, Corynebacterium,Enterococcus, Haemophilus, Helicobacter, Mycobacterium, Mycoplasma,Stenotrophomonas, Treponema, Vibrio, Yersinia, Acinetobacter baumanii,Bordetella pertussis, Brucella abortus, Brucella canis, Brucellamelitensis, Brucella suis, Campylobacter jejuni, Chlamydia pneumoniae,Chlamydia trachomatis, Chlamydophila psittaci, Clostridium botulinum,Clostridium difficile, Clostridium perfringens, Corynebacteriumdiphtheriae, Enterobacter sazakii, Enterobacter agglomerans,Enterobacter cloacae, Enterococcus faecalis, Enterococcus faecium,Escherichia coli, Francisella tularensis, Helicobacter pylori,Legionella pneumophila, Leptospira interrogans, Mycobacterium leprae,Mycobacterium tuberculosis, Mycobacterium ulcerans, Mycoplasmapneumoniae, Pseudomonas aeruginosa, Rickettsia rickettsii, Salmonellatyphi, Salmonella typhimurium, Salmonella enterica, Shigella sonnei,Staphylococcus epidermidis, Staphylococcus saprophyticus,Stenotrophomonas maltophilia, Vibrio cholerae, Yersinia pestis, and thelike.

Sequences in the reference database may be associated with viralinfectious agents. Non-limiting examples of viral pathogens include theherpes virus {e.g., human cytomegalomous virus (HCMV), herpes simplexvirus 1 (HSV-1), herpes simplex virus 2 (HSV-2), varicella zoster virus(VZV), Epstein-Barr virus), influenza A virus and Hepatitis C virus(HCV) (see Munger et al, Nature Biotechnology (2008) 26: 1179-1186; Syedet al, Trends in Endocrinology and Metabolism (2009) 21:33-40; Sakamotoet al, Nature Chemical Biology (2005) 1:333-337; Yang et al, Hepatology(2008) 48: 1396-1403) or a picomavirus such as Coxsackievirus B3 (CVB3)(see Rassmann et al, Anti-viral Research (2007) 76: 150-158). Otherexemplary viruses include, but are not limited to, the hepatitis Bvirus, HIV, poxvirus, hepadnavirus, retrovirus, and RNA viruses such asflavivirus, togavirus, coronavirus, Hepatitis D virus, orthomyxovirus,paramyxovirus, rhabdovirus, bunyavirus, filo virus, Adenovirus, Humanherpesvirus, type 8, Human papillomavirus, BK virus, JC virus, Smallpox,Hepatitis B virus, Human bocavirus, Parvovirus B19, Human astrovirus,Norwalk virus, coxsackievirus, hepatitis A virus, poliovirus,rhinovirus, Severe acute respiratory syndrome virus, Hepatitis C virus,yellow fever virus, dengue virus, West Nile virus, Rubella virus,Hepatitis E virus, and Human immunodeficiency virus (HIV). In certainembodiments, the virus is an enveloped virus. Examples include, but arenot limited to, viruses that are members of the hepadnavirus family,herpesvirus family, iridovirus family, poxvirus family, flavivirusfamily, togavirus family, retrovirus family, coronavirus family,filovirus family, rhabdovirus family, bunyavirus family, orthomyxovirusfamily, paramyxovirus family, and arenavirus family. Other examplesinclude, but are not limited to, Hepadnavirus hepatitis B virus (HBV),woodchuck hepatitis virus, ground squirrel (Hepadnaviridae) hepatitisvirus, duck hepatitis B virus, heron hepatitis B virus, Herpesvirusherpes simplex virus (HSV) types 1 and 2, varicella-zoster virus,cytomegalovirus (CMV), human cytomegalovirus (HCMV), mousecytomegalovirus (MCMV), guinea pig cytomegalovirus (GPCMV), Epstein-Barrvirus (EBV), human herpes virus 6 (HHV variants A and B), human herpesvirus 7 (HHV-7), human herpes virus 8 (HHV-8), Kaposi'ssarcoma-associated herpes virus (KSHV), B virus Poxvirus vaccinia virus,variola virus, smallpox virus, monkeypox virus, cowpox virus, camelpoxvirus, ectromelia virus, mousepox virus, rabbitpox viruses, raccoonpoxviruses, molluscum contagiosum virus, orf virus, milker's nodes virus,bovin papullar stomatitis virus, sheeppox virus, goatpox virus, lumpyskin disease virus, fowlpox virus, canarypox virus, pigeonpox virus,sparrowpox virus, myxoma virus, hare fibroma virus, rabbit fibromavirus, squirrel fibroma viruses, swinepox virus, tanapox virus, Yabapoxvirus, Flavivirus dengue virus, hepatitis C virus (HCV), GB hepatitisviruses (GBV-A, GBV-B and GBV-C), West Nile virus, yellow fever virus,St. Louis encephalitis virus, Japanese encephalitis virus, Powassanvirus, tick-borne encephalitis virus, Kyasanur Forest disease virus,Togavirus, Venezuelan equine encephalitis (VEE) virus, chikungunyavirus, Ross River virus, Mayaro virus, Sindbis virus, rubella virus,Retrovirus human immunodeficiency virus (HIV) types 1 and 2, human Tcell leukemia virus (HTLV) types 1, 2, and 5, mouse mammary tumor virus(MMTV), Rous sarcoma virus (RSV), lentiviruses, Coronavirus, severeacute respiratory syndrome (SARS) virus, Filovirus Ebola virus, Marburgvirus, Metapneumovirus (MPV) such as human metapneumovirus (HMPV),Rhabdovirus rabies virus, vesicular stomatitis virus, Bunyavirus,Crimean-Congo hemorrhagic fever virus, Rift Valley fever virus, LaCrosse virus, Hantaan virus, Orthomyxovirus, influenza virus (types A,B, and C), Paramyxovirus, parainfluenza virus (PIV types 1, 2 and 3),respiratory syncytial virus (types A and B), measles virus, mumps virus,Arenavirus, lymphocytic choriomeningitis virus, Junin virus, Machupovirus, Guanarito virus, Lassa virus, Ampari virus, Flexal virus, Ippyvirus, Mobala virus, Mopeia virus, Latino virus, Parana virus, Pichindevirus, Punta toro virus (PTV), Tacaribe virus and Tamiami virus. In someembodiments, the virus is a non-enveloped virus, examples of whichinclude, but are not limited to, viruses that are members of theparvovirus family, circovirus family, polyoma virus family,papillomavirus family, adenovirus family, iridovirus family, reovirusfamily, birnavirus family, calicivirus family, and picornavirus family.Specific examples include, but are not limited to, canine parvovirus,parvovirus B19, porcine circovirus type 1 and 2, BFDV (Beak and FeatherDisease virus, chicken anaemia virus, Polyomavirus, simian virus 40(SV40), JC virus, BK virus, Budgerigar fledgling disease virus, humanpapillomavirus, bovine papillomavirus (BPV) type 1, cotton tail rabbitpapillomavirus, human adenovirus (HAdV-A, HAdV-B, HAdV-C, HAdV-D,HAdV-E, and HAdV-F), fowl adenovirus A, bovine adenovirus D, frogadenovirus, Reovirus, human orbivirus, human coltivirus, mammalianorthoreovirus, bluetongue virus, rotavirus A, rotaviruses (groups B toG), Colorado tick fever virus, aquareovirus A, cypovirus 1, Fiji diseasevirus, rice dwarf virus, rice ragged stunt virus, idnoreovirus 1,mycoreovirus 1, Bimavirus, bursal disease virus, pancreatic necrosisvirus, Calicivirus, swine vesicular exanthema virus, rabbit hemorrhagicdisease virus, Norwalk virus, Sapporo virus, Picornavirus, humanpolioviruses (1-3), human coxsackieviruses Al-22, 24 (CAl-22 and CA24,CA23 (echovirus 9)), human coxsackieviruses (Bl-6 (CBl-6)), humanechoviruses 1-7, 9, 11-27, 29-33, vilyuish virus, simian enteroviruses1-18 (SEV1-18), porcine enteroviruses 1-11 (PEVl-11), bovineenteroviruses 1-2 (BEV1-2), hepatitis A virus, rhinoviruses,hepatoviruses, cardio viruses, aphthoviruses and echoviruses. The virusmay be phage. Examples of phages include, but are not limited to T4, T5,λ phage, T7 phage, G4, P1, φ6, Thermoproteus tenax virus 1, M13, MS2,Qβ, φX174, Φ29, PZA, Φ15, BS32, B103, M2Y (M2), Nf, GA-1, FWLBc1,FWLBc2, FWLLm3, B4. The reference database may comprise sequences forphage that are pathogenic, protective, or both. In some cases, the virusis selected from a member of the Flaviviridae family (e.g., a member ofthe Flavivirus, Pestivirus, and Hepacivirus genera), which includes thehepatitis C virus, Yellow fever virus; Tick-borne viruses, such as theGadgets Gully virus, Kadam virus, Kyasanur Forest disease virus, Langatvirus, Omsk hemorrhagic fever virus, Powassan virus, Royal Farm virus,Karshi virus, tick-borne encephalitis virus, Neudoerfl virus, Sofjinvirus, Louping ill virus and the Negishi virus; seabird tick-borneviruses, such as the Meaban virus, Saumarez Reef virus, and the Tyuleniyvirus; mosquito-borne viruses, such as the Aroa virus, dengue virus,Kedougou virus, Cacipacore virus, Koutango virus, Japanese encephalitisvirus, Murray Valley encephalitis virus, St. Louis encephalitis virus,Usutu virus, West Nile virus, Yaounde virus, Kokobera virus, Bagazavirus, Ilheus virus, Israel turkey meningoencephalo-myelitis virus,Ntaya virus, Tembusu virus, Zika virus, Banzi virus, Bouboui virus, EdgeHill virus, Jugra virus, Saboya virus, Sepik virus, Uganda S virus,Wesselsbron virus, yellow fever virus; and viruses with no knownarthropod vector, such as the Entebbe bat virus, Yokose virus, Apoivirus, Cowbone Ridge virus, Jutiapa virus, Modoc virus, Sal Vieja virus,San Perlita virus, Bukalasa bat virus, Carey Island virus, Dakar batvirus, Montana myotis leukoencephalitis virus, Phnom Penh bat virus, RioBravo virus, Tamana bat virus, and the Cell fusing agent virus. In somecases, the virus is selected from a member of the Arenaviridae family,which includes the Ippy virus, Lassa virus (e.g., the Josiah, LP, orGA391 strain), lymphocytic choriomeningitis virus (LCMV), Mobala virus,Mopeia virus, Amapari virus, Flexal virus, Guanarito virus, Junin virus,Latino virus, Machupo virus, Oliveros virus, Parana virus, Pichindevirus, Pirital virus, Sabia virus, Tacaribe virus, Tamiami virus,Whitewater Arroyo virus, Chapare virus, and Lujo virus. In some cases,the virus is selected from a member of the Bunyaviridae family (e.g., amember of the Hantavirus, Nairovirus, Orthobunyavirus, and Phlebovirusgenera), which includes the Hantaan virus, Sin Nombre virus, Dugbevirus, Bunyamwera virus, Rift Valley fever virus, La Crosse virus, PuntaToro virus (PTV), California encephalitis virus, and Crimean-Congohemorrhagic fever (CCHF) virus. In some cases, the virus is selectedfrom a member of the Filoviridae family, which includes the Ebola virus(e.g., the Zaire, Sudan, Ivory Coast, Reston, and Uganda strains) andthe Marburg virus (e.g., the Angola, Ci67, Musoke, Popp, Ravn and LakeVictoria strains); a member of the Togaviridae family (e.g., a member ofthe Alphavirus genus), which includes the Venezuelan equine encephalitisvirus (VEE), Eastern equine encephalitis virus (EEE), Western equineencephalitis virus (WEE), Sindbis virus, rubella virus, Semliki Forestvirus, Ross River virus, Barmah Forest virus, O'nyong'nyong virus, andthe chikungunya virus; a member of the Poxyiridae family (e.g., a memberof the Orthopoxvirus genus), which includes the smallpox virus,monkeypox virus, and vaccinia virus; a member of the Herpesviridaefamily, which includes the herpes simplex virus (HSV; types 1, 2, and6), human herpes virus (e.g., types 7 and 8), cytomegalovirus (CMV),Epstein-Barr virus (EBV), Varicella-Zoster virus, and Kaposi's sarcomaassociated-herpesvirus (KSHV); a member of the Orthomyxoviridae family,which includes the influenza virus (A, B, and C), such as the H5N1 avianinfluenza virus or H1N1 swine flu; a member of the Coronaviridae family,which includes the severe acute respiratory syndrome (SARS) virus; amember of the Rhabdoviridae family, which includes the rabies virus andvesicular stomatitis virus (VSV); a member of the Paramyxoviridaefamily, which includes the human respiratory syncytial virus (RSV),Newcastle disease virus, hendravirus, nipahvirus, measles virus,rinderpest virus, canine distemper virus, Sendai virus, humanparainfluenza virus (e.g., 1, 2, 3, and 4), rhinovirus, and mumps virus;a member of the Picornaviridae family, which includes the poliovirus,human enterovirus (A, B, C, and D), hepatitis A virus, and thecoxsackievirus; a member of the Hepadnaviridae family, which includesthe hepatitis B virus; a member of the Papillomaviridae family, whichincludes the human papilloma virus; a member of the Parvoviridae family,which includes the adeno-associated virus; a member of the Astroviridaefamily, which includes the astrovirus; a member of the Polyomaviridaefamily, which includes the JC virus, BK virus, and SV40 virus; a memberof the Caliciviridae family, which includes the Norwalk virus; a memberof the Reoviridae family, which includes the rotavirus; and a member ofthe Retroviridae family, which includes the human immunodeficiency virus(HIV; e.g., types 1 and 2), and human T-lymphotropic virus Types I andII (HTLV-1 and HTLV-2, respectively).

Infectious agents with which sequences in the reference database may beassociated can be fungal. Examples of infectious fungal infectiousagents include, without limitation Aspergillus, Blastomyces,Coccidioides, Cryptococcus, Histoplasma, Paracoccidioides, Sporothrix,and at least three genera of Zygomycetes. Secondary infections that canworsen diaper rash include fungal organisms (for example yeasts of thegenus Candida). The above fungi, as well as many other fungi, can causedisease in pets and companion animals. The present teaching is inclusiveof substrates that contact animals directly or indirectly. Examples oforganisms that cause disease in animals include Malasseziafurfur,Epidermophyton floccosur, Trichophyton mentagrophytes, Trichophytonrubrum, Trichophyton tonsurans, Trichophyton equinum, Dermatophiluscongolensis, Microsporum canis, Microsporu audouinii, Microsporumgypseum, Malassezia ovale, Pseudallescheria, Scopulariopsis,Scedosporium, and Candida albicans. Further examples of fungalinfectious agent include, but are not limited to, Aspergillus,Blastomyces dermatitidis, Candida, Coccidioides immitis, Cryptococcusneoformans, Histoplasma capsulatum var. capsulatum, Paracoccidioidesbrasiliensis, Sporothrix schenckii, Zygomycetes spp., Absidiacorymbifera, Rhizomucor pusillus, or Rhizopus arrhizus.

Another example of infectious agents with which sequences in a referencedatabase may be associated are parasites. Non-limiting examples ofparasites include Plasmodium, Leishmania, Babesia, Treponema, Borrelia,Trypanosoma, Toxoplasma gondii, Plasmodium falciparum, P. vivax, P.ovale, P. malariae, Trypanosoma spp., or Legionella spp.

The reference database may combine sequences associated with differentinfectious agents (e.g. reference sequences associated with infection bya variety of bacterial agents, a variety of viral agents, and a varietyof fungal agents). Moreover, the reference database may comprisesequences identified as originating from a pathogen that has not yetbeen identified or classified.

Reference sequences associated with a condition also include geneticmarkers for drug resistance, pathogenicity, and disease. A variety ofdisease-associated markers are known, which may be represented in thereference database. A disease-associated marker may be a causal geneticvariant. In general, causal genetic variants are genetic variants forwhich there is statistical, biological, and/or functional evidence ofassociation with a disease or trait. A. single causal genetic variantcan be associated with more than one disease or trait. In someembodiments, a causal genetic variant can be associated with a Mendeliantrait, a non-Mendelian trait, or both. Causal genetic variants canmanifest as variations in a polynucleotide, such 1, 2, 3, 4, 5, 6, 7, 8,9, 10, 20, 50, or more sequence differences (such as between apolynucleotide comprising the causal genetic variant and apolynucleotide lacking the causal genetic variant at the same relativegenomic position). Non-limiting examples of types of causal geneticvariants include single nucleotide polymorphisms (SNP),deletion/insertion polymorphisms (DIP), copy number variants (CNV),short tandem repeats (STR), restriction fragment length polymorphisms(RFLP), simple sequence repeats (SSR), variable number of tandem repeats(VNTR), randomly amplified polymorphic DNA (RAPD), amplified fragmentlength polymorphisms (AFLP), mter-retrotransposon amplifiedpolymorphisms (IRAP), long and short interspersed elements (LINE/SINE),long tandem repeats (LTR), mobile elements, retrotransposonmicrosatellite amplified polymorphisms, retrotransposon-based insertionpolymorphisms, sequence specific amplified polymorphism, and heritableepi genetic modification (for example, DNA methylation). A causalgenetic variant may also be a set of closely related causal geneticvariants. Some causal genetic variants may exert influence as sequencevariations in RNA polynucleotides. At this level, some causal geneticvariants are also indicated by the presence or absence of a species ofRNA polynucleotides. Also, some causal genetic variants result insequence variations in protein polypeptides. A number of causal geneticvariants are known in the art. An example of a causal genetic variantthat is a SNP is the Hb S variant of hemoglobin that causes sickle cellanemia. An example of a causal genetic variant that is a DIP is thedelta508 mutation of the CFTR gene which causes cystic fibrosis. Anexample of a causal genetic variant that is a CNV is trisomy 21, whichcauses Down's syndrome. An example of a causal genetic variant that isan STR is tandem repeat that causes Huntington's disease. Additionalnon-limiting examples of causal genetic variants are described inWO2014015084A2 and US20100022406. Examples of drug resistance markersinclude enzymes conferring resistance to various aminoglycosideantibiotics such as G418 and neomycin (e.g., an aminoglycoside3′-phosphotransferase, 3′APH II, also known as neomycinphosphotransferase II (nptII or “neo”)), Zeocin™ or bleomycin (e.g., theprotein encoded by the ble gene from Streptoalloteichus hindustanus),hygromycin (e.g., hygromycin resistance gene, hph, from Streptomyceshygroscopicus or from a plasmid isolated from Escherichia coli orKlebsiella pneumoniae, which codes for a kinase (hygromycinphosphotransferase, HPT) that inactivates Hygromycin B throughphosphorylation), puromycin (e.g., the Streptomyces albonigerpuromycin-N-acetyl-transferase (pac) gene), or blasticidin (e.g., anacetyl transferase encoded by the bls gene from Streptoverticillum sp.JCM 4673, or a deaminase encoded by a gene such as bsr, from Bacilluscereus or the BSD resistance gene from Aspergillus terreus). Otherexemplary drug resistance markers are dihydrofolate reductase (DHFR),adenosine deaminase (ADA), thymidine kinase (TK), andhypoxanthine-guanine phosphoribosyltransferase (HPRT). Proteins such asP-glycoprotein and other multidrug resistance proteins act as pumpsthrough which various cytotoxic compounds, e.g., chemotherapeutic agentssuch as vinblastine and anthracyclines, are expelled from cells.Exemplary markers of pathogenicity include: factors involved inouter-membrane protein expression, microbial toxins, factors involved inbiofilm formation, factors involved in carbohydrate transport andmetabolism, factors involved in cell envelope synthesis, and factorsinvolved in lipid metabolism. Exemplary markers of pathogenicity caninclude, but are not limited to gp120, ebola virus envelope protein, orother glycosylated viral envelope proteins or viral proteins.

The reference database may consist of host expression profilesassociated with a healthy state and/or one or more disease states, inwhich certain combinations of expressed genes (or levels of expressionof particular genes) identify a condition of a subject. The groups ofgenes may be overlapping. The reference database consisting of sequencesassociated with a condition may comprise both host expression profilesand groups of sequences associated with other conditions (e.g. referencesequences associated with various infectious agents).

In cases where the reference database consists of sequences associatedwith a condition, the method may comprise identifying the condition inthe sample or the source from which the sample is derived. The conditionmay be identified based on the presence or change in 10%, 20%, 30%, 40%,50%, 60%, 70%, 80%, 90%, or 100% of the components of a biosignature.Alternatively, a condition may be identified based on the presence orchange in less than 20%, 10%, 1%, 0.10%, 0.01%, 0.001%, 0.0001%, or0.00001% of the components of a biosignature. In some embodiments, asample is identified as affected by the condition if at least 80% of thesequences and/or taxa associated with the condition are identified aspresent (or present at a level associated with the condition). In someembodiments, the sample is identified as affected by the condition if atleast 90%, 95%, 99%, or all sequences or taxa (or quantities of these)associated with the condition are present. Where the condition is one ofbeing from a particular individual, such as an individual subject (e.g.a human in a database of sequences from a plurality of differenthumans), identifying the sample as being affected by the conditioncomprises identifying the sample as being from the individual to whomthe sequences in the database correspond. In some embodiments,identifying a subject as the source of the sample is based on only afraction of the subject's genomic sequence (e.g. less than 50%, 25%,10%, 5%, or less).

The presence, absence, or abundance of particular sequences,polymorphisms, or taxa can be used for diagnostic purposes, such asinferring that a sample or subject has a particular condition (e.g. anillness), has had a particular condition, or is likely to develop aparticular condition if sequence reads associated with the condition(e.g. from a particular disease-causing organism) are present at higherlevels than a control (e.g. an uninfected individual). In anotherembodiment, the sequencing reads can originate from the host andindicate the presence of a disease-causing organism by measuring thepresence, absence, or abundance of a host gene in a sample. Thepresence, absence, or abundance can be used to determine the need for atreatment or care intensity, inform the choice of a treatment, infereffectiveness of a treatment, wherein a decrease in the number ofsequencing reads from a disease-causing agent after treatment, or achange in the presence, absence, or abundance of specific host-responsegenes, indicates that a treatment is effective, whereas no change orinsufficient change indicates that the treatment is ineffective. Thesample can be assayed before or one or more times after treatment isbegun. In some examples, the treatment of the infected subject isaltered based on the results of the monitoring.

In some cases, one or more samples (e.g. blood, plasma, other bodyfluids, tissues, swab samples etc.) having a known condition may be usedto establish a biosignature for that condition. The biosignature may beestablished by associating the record database with the condition. Thecondition can be any condition described herein. For example, aplurality of samples from a particular environmental source may be usedto identify sequences and/or taxa associated with that environmentalsource, thereby establishing a biosignature consisting of thosesequences and/or taxa so associated. In general, the term “biosignature”is used to refer to an association of the presence, absence, orabundance of a plurality of sequences and/or taxa with a particularcondition, such as a classification, diagnosis, prognosis, and/orpredicted outcome of a condition in a subject; a sample source;contamination by one or more contaminants; or other condition. Abiosignature may be used as a reference database associated with acondition for the identification of that condition in another sample. Inone embodiment, the establishing the biosignature comprises adetermination of the presence, absence, and/or quantity of at least 10,50, 100, 1000, 10000, 100000, 1000000. or more sequences and/or taxa ina sample using a single assay. Establishing a biosignature may comprisecomparing sequencing reads for one or more samples representative of thecondition with one or more samples not representative of the condition.For example, a biosignature can consist of gene expression involved in ahost response (e.g. an immune response) among individuals infected by avirus, which sequences may be compared to sequences from subjects thatare not infected or are infected by some other agent (e.g. bacteria). Insuch case, the presence, absence, or abundance of particular sequencingreads may be associated with a viral rather than a bacterial infection.In another example, the biosignature can consist of sequences of genesinvolved in a variety of antiviral responses, the presence, absence, orabundance of sequencing reads associated with which can be indicative ofa specific class or type of viral infection. In some embodiments, thebiosignature associated with a reference database consists of thesequences (and optionally levels) of host transcripts and/or thesequences (and optionally levels) of transcripts or genomes of one ormore infectious agents. In one particular example, the condition isinfluenza infection and the biosignature consists of sequences of one ormore of (e.g. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, or all of) IFIT1,IFI6, IFIT2, ISG15, OASL, IFIT3, NT5C3A, MX2, IFITM1, CXCL10, IFI44L,MX1, IFIH1, OAS2, SAMD9, RSAD2, and DDX58. In another example, thereference database could be common mutations or gene fusions found incancerous cells, and the presence, absence, or abundance of sequencingreads associated with the biosignature can indicate that the patient hasor does not have detectable cancer, what type of cancer a detectablecancer is, a preferred treatment method, whether existing treatment iseffective, and/or prognosis.

In another example, the reference database can comprise sequencesassociated with contamination, such as polynucleotide and/or amino acidsequences from food contaminants, surface contaminants, or environmentalcontaminants. Examples of common food contaminants are Escherichia co/i,Clostridium botulinum, Salmonella, Listeria, and Vibrio cholerae.Examples of surface contaminants are Escherichia co/i, Clostridiumbotulinum, Salmonella, Listeria, Vibrio cholerae, influenza virus,methicillin-resistant Staphylococcus aureus, vancomycin-resistantEnterococci, Pseudomonas spp., Acinetobacter spp., Clostridiumdifficile, and norovirus. Examples of environmental contaminants arefungi such as Aspergillus and Wallemia sebi; chromalveolata such asdinoflagellates; amoebae; viruses; and bacteria. Contaminants may beinfectious agents, examples of which are provided herein.

In some cases, the database of references sequence comprisespolynucleotide sequences reverse-translated from amino acid sequences.In this context, translation refers to the process of using the codoncode to determine an amino acid sequence from a nucleotide sequence. Thestandard codon code is degenerate, such that multiple three-nucleotidecodons encode the same amino acid. As such, reverse-translation oftenproduces a variety of possible sequences that could encode a particularamino acid sequence. In some embodiments, to simplify this process,reverse-translation can use a non-degenerate code, such that each aminoacid is only represented by a single codon. For example, in the standardDNA codon system, phenylalanine is encoded by “TTT” and “TTC.” Anon-degenerate code would only associate one of the codons withphenylalanine. A sequencing read can be compared to this non-degenerate,reverse-translated sequence by any of the methods described herein.Furthermore, the sequencing read can be translated into all sixreading-frames and reverse-translated using the same non-degenerate codeto generate six polynucleotides that do not include alternate codonsprior to comparing. By reverse-translating a reference amino acidsequence, and comparing it to sequencing reads translated thenreverse-translated using the same reverse-translation code, nucleic acidsequences may be analyzed in the protein space.

Comparing sequences in accordance with a method of the disclosure canprovide a variety of benefits. For example, computational resources usedin the performance of a method may be substantially decreased relativeto a reference method, such as a method based on traditional sequencealignment. For example the speed with which a plurality of sequences ina sample are identified may be substantially increased. In someembodiments, identifying sequencing reads as corresponding to aparticular reference sequence in a database of reference sequences maybe completed for 20,000 sequences in less than 1.5 seconds. In someembodiments, at least about 500000, 1000000, 2000000, 3000000, 4000000,5000000, 10000000, or more sequences are identified per minute. The setof sequences and processor used for benchmarking sequence identificationprocessivity may be any that are described herein. In some embodiments,the sequencing reads used for benchmarking comprise sequences from twoor more of bacteria, viruses, fungi, and humans. Performance of a methoddescribed herein may be defined relative to a reference tool, such asSURPI (see e.g. Naccache, S. N. et al. A cloud-compatible bioinformaticspipeline for ultrarapid pathogen identification from next-generationsequencing of clinical samples. Genome research 24, 1180-1192 (2014)) orKraken (see e.g. Wood, D. E. & Salzberg, S. L. Kraken: ultrafastmetagenomic sequence classification using exact alignments. Genomebiology 15, R46 (2014)). In some embodiments, a method of the disclosureis at least 5-fold, 10-fold, 50-fold, 100-fold, 250-fold or more fasterthan SURPI in reaching results that are at least as accurate as SURPIusing the same data set and computer hardware. In some embodiments, amethod of the present disclosure provides improved accuracy relative toa reference analysis tool. For example, accuracy may be improved by atleast 5%, 6%, 7%, 8%, 9%, 10%, 15%, 20%, 25%, or more, using the samedata set and computer hardware. In some embodiments, sequences and/ortaxa present in a known sample are identifies with an accuracy of atleast about 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, or higher. In someembodiments, the methods provided herein are operable to distinguishbetween two or more different polynucleotides based on only a fewsequence differences. For example, methods provided herein may beutilized to distinguish between two or more strains of taxa (e.g.bacterial strains) based on a low degree of sequence variation betweenthe compared taxa. In some embodiments, one or more taxa comprise afirst bacterial strain identified as present and a second bacterialstrain identified as absent based on one or more nucleotide differencesin sequence (e.g. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 25, 50, or moredifferences). In some embodiments, taxa are distinguished based on fewerthan 25, 10, 5, 4, 3, 2, or fewer sequence differences. In someembodiments, the first bacterial strain is identified as present and thesecond bacterial strain is identified as absent based on a singlenucleotide difference in sequence (e.g. a SNP).

Sequencing data for analysis may be provided by a user, which may havebeen produced by any suitable means. Sequencing data may also begenerated by isolating polynucleotides from a sample and sequencing aplurality of the polynucleotides. Samples from which polynucleotides maybe derived for analysis by the present methods and systems can be fromany of a variety of sources. Non-limiting examples of sample sourcesinclude environmental sources, industrial sources, one or more subjects,and one or more populations of microbes. Examples of environmentalsources include, but are not limited to agricultural fields, lakes,rivers, water reservoirs, air vents, walls, roofs, soil samples, plants,and swimming pools. Examples of industrial sources include, but are notlimited to clean rooms, hospitals, food processing areas, foodproduction areas, food stuffs, medical laboratories, pharmacies, andpharmaceutical compounding centers. Polynucleotides may be isolated fromchromalveolata such as malaria, and dinoflagellates. Examples ofsubjects from which polynucleotides may be isolated includemulticellular organisms, such as fish, amphibians, reptiles, birds, andmammals. Examples of mammals include be primates (e.g., apes, monkeys,gorillas), rodents (e.g., mice, rats), cows, pigs, sheep, horses, dogs,cats, or rabbits. In preferred embodiments, the mammal is a human. Insome cases, the sample is an individual subject. A sample may comprise asample from a subject, such as whole blood; blood products; red bloodcells; white blood cells; buffy coat; swabs; urine; sputum; saliva;semen; lymphatic fluid; amniotic fluid; cerebrospinal fluid; peritonealeffusions; pleural effusions; biopsy samples; fluid from cysts; synovialfluid; vitreous humor; aqueous humor; bursa fluid; eye washes; eyeaspirates; plasma; serum; pulmonary lavage; lung aspirates; animal,including human, tissues, including but not limited to, liver, spleen,kidney, lung, intestine, brain, heart, muscle, pancreas, cell cultures,as well as lysates, extracts, or materials and fractions obtained fromthe samples described above or any cells and microorganisms and virusesthat may be present on or in a sample. A sample may comprise cells of aprimary culture or a cell line. Examples of cell lines include, but arenot limited to 293-T human kidney cells, A2870 human ovary cells, A431human epithelium, B35 rat neuroblastoma cells, BHK-21 hamster kidneycells, BR293 human breast cells, CHO chinese hamster ovary cells, CORL23human lung cells, HeLa cells, or Jurkat cells. The sample may comprise ahomogeneous or mixed population of microbes, including one or more ofviruses, bacteria, protists, monerans, chromalveolata, archaea, orfungi. Examples of viruses include, but are not limited to humanimmunodeficiency virus, ebola virus, rhinovirus, influenza, rotavirus,hepatitis virus, West Nile virus, ringspot virus, mosaic viruses,herpesviruses, lettuce big-vein associated virus. Non-limiting examplesof bacteria include Staphylococcus aureus, Staphylococcus aureus Mu3;Staphylococcus epidermidis, Streptococcus agalactiae, Streptococcuspyogenes, Streptococcus pneumonia, Escherichia coli, Citrobacter koseri,Clostridium perfringens, Enterococcus faecalis, Klebsiella pneumonia,Lactobacillus acidophilus, Listeria monocytogenes, Propionibacteriumgranulosum, Pseudomonas aeruginosa, Serratia marcescens, Bacillus cereusStaphylococcus aureus Mu50 Yersinia enterocolitica Staphylococcussimulans Micrococcus luteus and Enterobacter aerogenes Examples of fungiinclude, but are not limited to Absidia corymbifera, Aspergillus niger,Candida albicans, Geotrichum candidum, Hansenula anomala, Microsporumgypseum, Monilia, Mucor, Penicilliusidia corymbifera, Aspergillus niger,Candida albicans, Geotrichum candidum, Hansenula anomala, Microsporumgypseum, Monilia, Mucor, Penicillium expansum, Rhizopus, Rhodotorula,Saccharomyces bayabus, Saccharomyces carlsbergensis, Saccharomycesuvarum, and Saccharomyces cerevisiae. A sample can also be processedsamples such as preserved, fixed and/or stabilised samples. A sample cancomprise or consist essentially of RNA. A sample can comprise or consistessentially of DNA. In some embodiments, cell-free polynucleotides (e.g.cell-free DNA and/or cell-free RNA) are analyzed. In general, cell-freepolynucleotides are extracellular polynucleotides present in a sample(e.g. a sample from which cells have been removed, a sample that is notsubjected to a lysis step, or a sample that is treated to separatecellular polynucleotides from extracellular polynucleotides). Forexample, cell-free polynucleotides include polynucleotides released intocirculation upon death of a cell, and are isolated as cell-freepolynucleotides from the plasma fraction of a blood sample.

Methods for the extraction and purification of nucleic acids are wellknown in the art. For example, nucleic acids can be purified by organicextraction with phenol, phenol/chloroform/isoamyl alcohol, or similarformulations, including TRIzol and TriReagent. Other non-limitingexamples of extraction techniques include: (1) organic extractionfollowed by ethanol precipitation, e.g., using a phenol/chloroformorganic reagent with or without the use of an automated nucleic acidextractor, e.g., the Model 341 DNA Extractor available from AppliedBiosystems (Foster City, Calif.); (2) stationary phase adsorptionmethods; and (3) salt-induced nucleic acid precipitation methods, suchprecipitation methods being typically referred to as “salting-out”methods. Another example of nucleic acid isolation and/or purificationincludes the use of magnetic particles to which nucleic acids canspecifically or non-specifically bind, followed by isolation of thebeads using a magnet, and washing and eluting the nucleic acids from thebeads. In some embodiments, the above isolation methods may be precededby an enzyme digestion step to help eliminate unwanted protein from thesample, e.g., digestion with proteinase K, or other like proteases. Ifdesired, RNase inhibitors may be added to the lysis buffer. For certaincell or sample types, it may be desirable to add a proteindenaturation/digestion step to the protocol. Purification methods may bedirected to isolate DNA, RNA, or both. When both DNA and RNA areisolated together during or subsequent to an extraction procedure,further steps may be employed to purify one or both separately from theother. Sub-fractions of extracted nucleic acids can also be generated,for example, purification by size, sequence, or other physical orchemical

The extracted polynucleotides from the samples can be sequenced togenerate sequencing reads. Exemplary sequencing techniques can include,for example emulsion PCR (pyrosequencing from Roche 454, semiconductorsequencing from Ion Torrent, SOLiD sequencing by ligation from LifeTechnologies, sequencing by synthesis from Intelligent Biosystems),bridge amplification on a flow cell (e.g. Solexa/lllumina), isothermalamplification by Wildfire technology (Life Technologies) orrolonies/nanoballs generated by rolling circle amplification (CompleteGenomics, Intelligent Biosystems, Polonator). Sequencing technologieslike Heliscope (Helicos), SMRT technology (Pacific Biosciences) ornanopore sequencing (Oxford Nanopore) allow direct sequencing of singlemolecules without prior clonal amplification may be suitable sequencingplatforms. Sequencing may be performed with or without targetenrichment. In some cases, polynucleotides from a sample are amplifiedby any suitable means prior to and/or during sequencing.

As an example, DNA sequencing technology that is used in the disclosedmethods can be the Helicos True Single Molecule Sequencing (tSMS) (e.g.as described in Harris T. D. et al., Science 320:106-109 [2008]). In atypical tSMS process, a DNA sample is cleaved into strands ofapproximately 100 to 200 nucleotides, and a polyA sequence is added tothe 3′ end of each DNA strand. Each strand is labeled by the addition ofa fluorescently labeled adenosine nucleotide. The DNA strands are thenhybridized to a flow cell, which contains millions of oligo-T capturesites that are immobilized to the flow cell surface. The templates canbe at a density of about 100 million templates/cm². The flow cell isthen loaded into an instrument, e.g., HeliScope™ sequencer, and a laserilluminates the surface of the flow cell, revealing the position of eachtemplate. A CCD camera can map the position of the templates on the flowcell surface. The template fluorescent label is then cleaved and washedaway. The sequencing reaction begins by introducing a DNA polymerase anda fluorescently labeled nucleotide. The oligo-T nucleic acid serves as aprimer. The polymerase incorporates the labeled nucleotides to theprimer in a template directed manner. The polymerase and unincorporatednucleotides are removed. The templates that have directed incorporationof the fluorescently labeled nucleotide are discerned by imaging theflow cell surface. After imaging, a cleavage step removes thefluorescent label, and the process is repeated with other fluorescentlylabeled nucleotides until the desired read length is achieved. Sequenceinformation is collected with each nucleotide addition step.

Another example process for sequencing polynucleotides is 454 sequencing(Roche) (e.g. as described in Margulies, M. et al. Nature 437:376-380(2005)). In a first step, DNA is typically sheared into fragments ofapproximately 300-800 base pairs, and the fragments are blunt-ended.Oligonucleotide adaptors are then ligated to the ends of the fragments.The adaptors serve as primers for amplification and sequencing of thefragments. The fragments can be attached to DNA capture beads, e.g.,streptavidin-coated beads using, e.g., Adaptor B, which contains5′-biotin tag. The fragments attached to the beads are PCR amplifiedwithin droplets of an oil-water emulsion. The result is multiple copiesof clonally amplified DNA fragments on each bead. In the second step,the beads are captured in wells (pico-liter sized). Pyrosequencing isperformed on each DNA fragment in parallel. Addition of one or morenucleotides generates a light signal that is recorded by a CCD camera ina sequencing instrument. The signal strength is proportional to thenumber of nucleotides incorporated. Pyrosequencing makes use ofpyrophosphate (PPi) which is released upon nucleotide addition. PPi isconverted to ATP by ATP sulfurylase in the presence of adenosine 5′phosphosulfate. Luciferase uses ATP to convert luciferin tooxyluciferin, and this reaction generates light that is discerned andanalyzed.

A further example of suitable DNA sequencing technology is the SOLiD™technology (Applied Biosystems). In SOLiD™ sequencing-by-ligation,genomic DNA is sheared into fragments, and adaptors are attached to the5′ and 3′ ends of the fragments to generate a fragment library.Alternatively, internal adaptors can be introduced by ligating adaptorsto the 5′ and 3′ ends of the fragments, circularizing the fragments,digesting the circularized fragment to generate an internal adaptor, andattaching adaptors to the 5′ and 3′ ends of the resulting fragments togenerate a mate-paired library. Next, clonal bead populations areprepared in microreactors containing beads, primers, template, and PCRcomponents. Following PCR, the templates are denatured and beads areenriched to separate the beads with extended templates. Templates on theselected beads are subjected to a 3′ modification that permits bondingto a glass slide. The sequence can be determined by sequentialhybridization and ligation of partially random oligonucleotides with acentral determined base (or pair of bases) that is identified by aspecific fluorophore. After a color is recorded, the ligatedoligonucleotide is cleaved and removed and the process is then repeated.

DNA sequencing may be by single molecule, real-time (SMRT™) sequencingtechnology of Pacific Biosciences. In SMRT sequencing, the continuousincorporation of dye-labeled nucleotides is imaged during DNA synthesis.Single DNA polymerase molecules are attached to the bottom surface ofindividual zero-mode wavelength identifiers (ZMW identifiers) thatobtain sequence information while phospholinked nucleotides are beingincorporated into the growing primer strand. A ZMW is a confinementstructure which enables observation of incorporation of a singlenucleotide by DNA polymerase against the background of fluorescentnucleotides that rapidly diffuse in an out of the ZMW (in microseconds).It takes several milliseconds to incorporate a nucleotide into a growingstrand. During this time, the fluorescent label is excited and producesa fluorescent signal, and the fluorescent tag is cleaved offIdentification of the corresponding fluorescence of the dye indicateswhich base was incorporated. The process is repeated.

The DNA sequencing technology that used in the disclosed methods may benanopore sequencing (e.g. as described in Soni G V and Meller A. ClinChem 53: 1996-2001 [2007]). Nanopore sequencing DNA analysis techniquesare being industrially developed by a number of companies, includingOxford Nanopore Technologies (Oxford, United Kingdom). Nanoporesequencing is a single-molecule sequencing technology whereby a singlemolecule of DNA is sequenced directly as it passes through a nanopore. Ananopore is a small hole, of the order of 1 nanometer in diameter.Immersion of a nanopore in a conducting fluid and application of apotential (voltage) across it results in a slight electrical current dueto conduction of ions through the nanopore. The amount of current whichflows is sensitive to the size and shape of the nanopore. As a DNAmolecule passes through a nanopore, each nucleotide on the DNA moleculeobstructs the nanopore to a different degree, changing the magnitude ofthe current through the nanopore in different degrees. Thus, this changein the current as the DNA molecule passes through the nanoporerepresents a reading of the DNA sequence.

In one embodiment, the DNA sequencing technology that is used in thedisclosed methods is the chemical-sensitive field effect transistor(chemFET) array (see e.g. US20090026082). In one example of thetechnique, DNA molecules can be placed into reaction chambers, and thetemplate molecules can be hybridized to a sequencing primer bound to apolymerase. Incorporation of one or more triphosphates into a newnucleic acid strand at the 3′ end of the sequencing primer can bediscerned by a change in current by a chemFET. An array can havemultiple chemFET sensors. In another example, single nucleic acids canbe attached to beads, and the nucleic acids can be amplified on thebead, and the individual beads can be transferred to individual reactionchambers on a chemFET array, with each chamber having a chemFET sensor,and the nucleic acids can be sequenced.

Another example of a suitable DNA sequencing technology is the IonTorrent single molecule sequencing, which pairs semiconductor technologywith a simple sequencing chemistry to directly translate chemicallyencoded information (A, C, G, T) into digital information (0, 1) on asemiconductor chip. In nature, when a nucleotide is incorporated into astrand of DNA by a polymerase, a hydrogen ion is released as abyproduct. Ion Torrent uses a high-density array of micro-machined wellsto perform this biochemical process in a massively parallel way. Eachwell holds a different DNA molecule. Beneath the wells is anion-sensitive layer and beneath that an ion sensor. When a nucleotide,for example a C, is added to a DNA template and is then incorporatedinto a strand of DNA, a hydrogen ion will be released. The charge fromthat ion will change the pH of the solution, which can be identified byIon Torrent's ion sensor. The sequencer-essentially the world's smallestsolid-state pH meter-calls the base, going directly from chemicalinformation to digital information. The Ion personal Genome Machine(PGM™) sequencer then sequentially floods the chip with one nucleotideafter another. If the next nucleotide that floods the chip is not amatch. No voltage change will be recorded and no base will be called. Ifthere are two identical bases on the DNA strand, the voltage will bedouble, and the chip will record two identical bases called. Directidentification allows recordation of nucleotide incorporation inseconds.

In one aspect, the disclosure provides a method of detecting a pluralityof taxa in a sample. In one embodiment, the method comprises providingsequencing reads for a plurality of polynucleotides from the sample, andfor each sequencing read: (a) assigning the sequencing read to a firsttaxonomic groups based on a first sequence comparison between thesequencing read and a first plurality of polynucleotide sequences fromthe different first taxonomic groups, wherein at least two sequencingreads are assigned to different taxonomic groups; (b) performing with acomputer system a second sequence comparison between the sequencing readand a second plurality of polynucleotide sequences corresponding tomembers of the first taxonomic group, wherein the comparison comprisescounting a number of k-mers within the sequencing read of at least 5nucleotides in length that exactly match one or more k-mers within areference sequence in the second plurality of polynucleotide sequences;(c) classifying the sequencing read as belonging to a second taxonomicgroup that is more specific than the first taxonomic group if a measureof similarity between the sequencing read and reference sequence isabove a first threshold level; (d) if no similarity above the firstthreshold level is identified in (c), classifying the sequencing read asbelonging to the second taxonomic group based on similarity above asecond threshold level determined by comparing with the computer systema sequence derived from translating the sequencing read and a third setof reference sequences corresponding to amino acid sequences of membersof the first taxonomic group; and (e) identifying the presence, absence,or abundance of the plurality of taxa in the sample based on theclassifying of the sequencing reads. In some cases, a sequencing readmay be identified as corresponding to a particular reference sequence,such as a gene transcript, if the measure of similarity between thesequencing read and reference sequence is above the first thresholdlevel.

Sequence comparison may comprise any method of sequence comparisondescribed herein. In some embodiments, sequence comparison comprises oneor more comparison steps in which one or more k-mers of a sequencingread are compared to k-mers of one or more reference sequences (alsoreferred to simply as a “reference”). In some embodiments, a k-mer isabout or more than about 3 nt, 4 nt, 5 nt, 6 nt, 7 nt, 8 nt, 9 nt, 10nt, 11 nt, 12 nt, 13 nt, 14 nt, 15 nt, 16 nt, 17 nt, 18 nt, 19 nt, 20nt, 25 nt, 30 nt, 35 nt, 40 nt, 45 nt, 50 nt, 75 nt, 100 nt, or more inlength. In some embodiments, a k-mer is about or less than about 30 nt,25 nt, 20 nt, 15 nt, 10 nt, or fewer in length. The k-mer may be in therange of 3 nt to 13 nt, 5 nt to 25 nt in length, 7 nt to 99 nt, or 3 ntto 99 nt in length. The length of k-mer analyzed at each step may vary.For example, a first comparison may compare k-mers in a sequencing readand a reference sequence that are 21 nt in length, whereas a secondcomparison may compare k-mers in a sequencing read and a referencesequence that are 7 nt in length. For any given sequence in a comparisonstep, k-mers analyzed may be overlapping (such as in a sliding window),and may be of same or different lengths. While k-mers are generallyreferred to herein as nucleic acid sequences, sequence comparison alsoencompasses comparison of polypeptide sequences, including comparison ofk-mers consisting of amino acids. Reference sequences and referencedatabases used in performing a sequence comparison can be any describedherein, such as with regard to any of the various aspects of thedisclosure.

In general, comparing k-mers in a read to a reference sequence comprisescounting k-mer matches between the two. The stringency for identifying amatch may vary. For example, a match may be an exact match, in which thenucleotide sequence of the k-mer from the read is identical to thenucleotide sequence of the k-mer from the reference. Alternatively, amatch may be an incomplete match, where 1, 2, 3, 4, 5, 10, or moremismatches are permitted. In addition to counting matches, a likelihood(also referred to as a “k-mer weight” or “KW”) can be calculated. Insome embodiments, the k-mer weight relates a count of a particular k-merwithin a particular reference sequence, a count of the particular k-meramong a group of sequences comprising the reference sequence, and acount of the particular k-mer among all reference sequences in thedatabase of reference sequences. In one embodiment, the k-mer weight iscalculated according to the following formula, which calculates thek-mer weight as a measure of how likely it is that a particular k-mer(K_(i)) originates from a reference sequence (ref_(i)) as follows:

$\begin{matrix}{{KWre{f_{i}\left( K_{i} \right)}} = \frac{{C_{\tau ef}\left( K_{i} \right)}/{C_{db}\left( K_{i} \right)}}{{{C_{db}\left( K_{i} \right)}/{Total}}{kmer}{count}}} & \left( {{Eqn}.1} \right)\end{matrix}$

C represents a function that returns the count of K_(i). C_(ref)(K_(i))indicates the count of the K_(i) in a particular reference.C_(db)(K_(i)) indicates the count of K_(i) in the database. This weightprovides a relative, database specific measure of how likely it is thata k-mer originated from a particular reference. Prior to comparing asequencing read to the database of reference sequences, the k-mer weight(or measurement of likelihood that a k-mer originates from a givenreference sequence) can be calculated for each k-mer and referencesequence in the database. In some cases, when a reference databasescomprises sequences from a plurality of taxa, each reference sequencecan be associated with a measure of likelihood, or k-mer weight, that ak-mer within the reference sequence originates from a taxon within aplurality of taxa. As a non-limiting example, a reference database cancomprise sequences from multiple species of canines, and the k-merweight could be calculated by relating the count of a given k-mer in allcanine sequences to its count in the entire database, which includesother taxa. In some examples, the k-mer weight measuring how likely itis that a k-mer originates from a specific taxon is calculated bydefining C_(ref)(K_(i)) in the above equation as a function that returnsthe total count of K_(i) in a particular taxon. Results may be stored ina record database, examples of which are described herein, such as withregard to any of the various aspects of the disclosure.

A single detection process may comprise multiple sequence comparisonsteps. One or more of the steps may be performed for all sequences to beevaluated by that step in parallel. In some embodiments, a sequencingread is assigned to a first taxonomic group based on a first sequencecomparison between the sequencing read and a first plurality ofpolynucleotide sequences from the different first taxonomic groups,wherein at least two sequencing reads are assigned to differenttaxonomic groups. A first taxonomic group may be a broad class,assignment to which may specify which reference database or referencesequences should be used in a second comparison to identify the sequenceor corresponding taxon with greater specificity. For example, assignmentto a first taxonomic class can comprise assigning a sequence to any ofbacteria, archaea, chromalveolata, viruses, fungi, plants, fish,amphibians, reptiles, birds, mammals, and humans. The first plurality ofpolynucleotides may be in the form of a reference database, which cancomprise sequences from any of a variety of taxa to which a sequence maybe assigned. The first comparison may be performed for all sequencingreads to be analyzed in parallel, such that assignment to a firsttaxonomic group comprises assignment to the group yielding the closestmatch among all groups to which a sequencing read is compared.

After assigning sequencing reads to a first taxonomic group, a secondsequence comparison step may be performed, wherein a sequencing read anda second plurality of polynucleotide sequences corresponding to membersof the first taxonomic group to which the read was assigned arecompared. The second comparison will typically comprise counting anumber of k-mers within the sequencing read of at least 5 nucleotides inlength that exactly match one or more k-mers within a reference sequencein the second plurality of polynucleotide sequences. Examples of k-meranalyses are provided herein, such as with respect to any of the variousaspects of the disclosure. The second plurality of sequences may be inthe form of a second reference database. The second plurality ofpolynucleotide sequences may comprise or consist of the subset ofsequences associated with the first taxonomic group to which thesequencing read was assigned, or only a subset of these. The secondplurality of polynucleotide sequences may comprise or consist ofsequences associated with the first taxonomic group that were not amongthe first polynucleotide sequences. The parameters for the secondsequence comparison may be the same or different from the parametersused in the first sequence comparison. For example, k-mer length, k-merweight threshold to identify a match, or stringency may be the same ordifferent, each of which may be varied independently.

As a result of the second sequence comparison, a sequencing read may beclassified as belonging to a second taxonomic group that is morespecific than the first taxonomic group if a measure of similaritybetween the sequencing read and reference sequence is above a firstthreshold level. Threshold for making an identification may varydepending on the parameters of the comparison. Examples of possiblethresholds are provided herein, such as with regard to any of thevarious aspects of the disclosure. Determining a threshold may comprisecalculating a sum of k-mer weights for a given sequencing read, asdescribed herein. The threshold value may be selected based on a varietyof factors, such as average read length, the reference sequences towhich the reads are compared, whether a specific sequence or sourceorganism is to be identified as present in the sample, and the like. Thethreshold value can be specific to the set of specified referencesequences. If the sum of k-mer weights for the reference sequence isabove the threshold level, the sequencing read may be identified ascorresponding to the reference sequence, and optionally the organism ortaxonomic group associated with the reference sequence. In some cases,the read is assigned to the reference sequence with the maximum sum ofk-mer weights, which may or may not be required to be above a threshold.In the case of a tie, where a sequence read has an equal k-mer weight ofbelonging to more than one reference sequence, the sequence read can beassigned to the taxonomic lowest common ancestor (LCA) taking intoaccount the read's total k-mer weight along each branch of thephylogenetic tree. In general, correspondence with a reference sequence,organism, or taxonomic group indicates that it was present in thesample. In general, a second taxonomic group is considered more specificthan a first taxonomic group when the second taxonomic group is of amore specific hierarchical order. For example, the first taxonomic groupmay be at the level of family, while the second taxonomic group is atthe level of genus or species. Where the first taxonomic group is at thespecies level, the second taxonomic group may be at the level of aspecific individual. For example, a sequence may be identified as humanin the first sequence comparison, and classification based on the secondcomparison may identify the particular human from which the sequence wasderived, a process which may further involve comparison of groups ofsequences.

In some cases, classifying a sequencing read is not possible on thebasis of the second comparison, such as in the case where the maximumsum of k-mer weights for a sequencing read is below a threshold. In thiscase, classifying the sequence read as belonging to the second taxonomicgroup can be based on similarity above a second threshold leveldetermined by comparing with the computer system a sequence derived fromtranslating the sequencing read and a third set of reference sequencescorresponding to amino acid sequences of members of the first taxonomicgroup. Methods for translating sequencing reads are described herein.The process may comprise translation of one or more reading frames, suchas all 6 reading frames. Comparison may be at the level of amino acids,where the translated sequencing read is compared to a set of referenceamino acid sequences. Alternatively, the translated sequencing reads maybe reverse-translated, and compared to reference sequences derived fromreverse-translating reference amino acid sequences. Methods fortranslating and reverse-translating are described herein, and includereverse-translating using a non-degenerate code. Reference amino acidsequences may be in the form of a reference database, examples of whichare described herein.

In some cases, classifying a sequencing read is still not possible onthe basis of the comparison to the third set of reference sequences,such as in the case where the maximum sum of k-mer weights for asequencing read is below a threshold. In this case, the method mayfurther comprise performing with the computer system a relaxed sequencecomparison between the sequencing read and the second plurality ofpolynucleotide sequences. In general, the relaxed sequence comparison isless stringent than the second sequence comparison. Methods for reducingstringency of a sequencing comparison are described herein, such withregard to any of the various aspects of the disclosure. Classifying maythen be possible based on identifying matching sequences at the lowerstringency. A similar reduced-stringency analysis may be applied withrespect to reverse-translated amino acid reference sequences, which maybe performed in place of or in addition to reduced-stringency comparisonof reference polynucleotide sequences.

At any given step, two or more reference sequences from different taxamay be identified as possibly corresponding to the sequencing read basedon the parameters for comparison. In such cases, the tie will usually beresolved in order to assign the sequencing read to just one referencesequence or taxon. In some cases, resolving a tie between two or morepossible taxonomic groups based on the k-mer weight that the sequencingread corresponds to a polynucleotide from an ancestor of one of thepossible taxonomic groups. Methods for resolving such ties are describedherein, such as with regard to any of the various aspects of thedisclosure.

Once a sequence has been classified as belonging to a second taxonomicgroup that is more specific than the first taxonomic group, thepresence, absence, or abundance (which may be relative abundance) of aplurality of taxa in the sample may be determined. Methods for makingsuch a determination on the basis of identifying sequencing reads areprovided herein, such as with regard to any of the various aspects ofthe disclosure. In some embodiments, a method may further comprisequantifying an amount of polynucleotides corresponding to a referencesequence identified in an earlier step. Quantification can be based on anumber of corresponding sequencing reads identified. This can includenormalizing the count by the total number of reads, the total number ofreads associated with sequences, the length of the reference sequence,or a combination thereof. Examples of such normalization include FPKMand RPKM, but may also include other methods that take into account therelative amount of reads in different samples, such as normalizingsequencing reads from samples by the median of ratios of observed countsper sequence. A difference in quantity between samples can indicate adifference between the two samples. The quantitation can be used toidentify differences between subjects, such as comparing the taxapresent in the microbiota of subjects with different diets, or toobserve changes in the same subject over time, such as observing thetaxa present in the microbiota of a subject before and after going on aparticular diet. If a sequencing read classified as belonging to thesecond taxonomic group is not present among the group of referencesequences associated with that second taxonomic group, it may be addedto the group of reference sequences for use in future comparisons.

In some embodiments, a method may comprise determining the presence,absence, or abundance of specific taxa within samples based on resultsof an earlier step. In this case, the plurality of referencepolynucleotide sequences typically comprise groups of sequencescorresponding to individual taxa in the plurality of taxa. In somecases, at least 50, 100, 250, 500, 1000, 5000, 10000, 50000, 100000,250000, 500000, or 1000000 different taxa are identified as absent orpresent (and optionally abundance, which may be relative) based onsequences analyzed by a method described herein. In some cases, thisanalysis is performed in parallel. In some embodiments, the methods,compositions, and systems of the present disclosure enable paralleldetection of the presence or absence of a taxon in a community of taxa,such as an environmental or clinical sample, when the taxon identifiedcomprises less than 0.05% of the total population of taxa in the sourcesample. In some cases, detection is based on sequencing readscorresponding to a polynucleotide that is present at less than 0.01% ofthe total nucleic acid population. The particular polynucleotide may beat least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 85%, 90%, 91%, 92%, 93%,94%, 95%, 96% or 97% homologous to other nucleic acids in thepopulation. In some cases, the particular polynucleotide is less than75%, 50%, 40%, 30%, 20%, or 10% homologous to other nucleic acids in thepopulation. Determining the presence, absence, or abundance of specifictaxa can comprise identifying an individual subject as the source of asample. For example, a reference database may comprise a plurality ofreference sequences, each of which corresponds to an individual organism(e.g. a human subject), with sequences from a plurality of differentsubject represented among the reference sequences. Sequencing reads foran unknown sample may then be compared to sequences of the referencedatabase, and based on identifying the sequencing reads in accordancewith a described method, an individual represented in the referencedatabase may be identified as the sample source of the sequencing reads.In such a case, the reference database may comprise sequences from atleast 10², 10¹, 10⁴, 10¹, 10⁶, 10⁷, 10⁸, 10⁹, or more individuals.

In some cases, a sequencing read does not have a match to a referencesequence at the level of a particular taxonomic group (e.g. at thespecies level), or at any taxonomic level. When no match is found, thecorresponding sequence may be added to a reference database on the basisof known characteristics. In some cases, when a sequence is identifiedas belonging to a particular taxon in the plurality of taxa, and is notpresent among the group of sequences corresponding to that taxon, it isadded to the group of sequences corresponding to the taxon for use inlater sequence comparisons. For example, if a bacterial genome isidentified as belonging to a particular taxon, such as a genus orfamily, but the genome comprises sequence that is not present in thesequences associated with that taxon, the bacterial genome can be addedto the sequence database. Likewise, if the sample is derived from aparticular source or condition, the sequencing read may be added to areference database of sequences associated with that source or conditionfor use in identifying future samples that share the same source orcondition. As a further example, a sequence that does not have a matchat a lower level but does have a match at a higher level, as identifiedaccording to a method described herein, may be assigned to that higherlevel while also adding the sequencing read to the plurality ofreference sequences that correspond to that taxonomic group. Referencedatabases so updated may be used in later sequence comparisons.

In some embodiments, identifying the presence, absence, or abundance ofthe plurality of taxa may be used to diagnose a condition based on adegree of similarity between the plurality of taxa detected in thesample and a biological signature for the condition. The condition canbe any of the conditions described herein with regard to any of theaspects of the disclosure. Example conditions include, but are notlimited to, contamination (e.g. environmental contamination, surfacecontamination, food contamination, air contamination, watercontamination, cell culture contamination), stimulus response (e.g. drugresponder or non-responder, allergic response, treatment response),infection (e.g. bacterial infection, fungal infection, viral infection),disease state (e.g. presence of disease, worsening of disease, diseaserecovery), a healthy state, or the identity of a sample source (e.g. aspecific location or an individual subject). Examples of these areprovided herein. The method may comprise identifying the condition inthe sample or the source from which the sample is derived. The conditionmay be identified based on the presence or change in 10%, 20%, 30%, 40%,50%, 60%, 70%, 80%, 90%, or 100% of the components of a biosignature.Alternatively, a condition may be identified based on the presence orchange in less than 20%, 10%, 1%, 0.1%, 0.01%, 0.001%, 0.0001%, or0.00001% of the components of a biosignature. In some embodiments, asample is identified as affected by the condition if at least 80% of thesequences and/or taxa associated with the condition are identified aspresent (or present at a level associated with the condition). In someembodiments, the sample is identified as affected by the condition if atleast 90%, 95%, 99%, or all sequences or taxa (or quantities of these)associated with the condition are present. Where the condition is one ofbeing from a particular individual, such as an individual subject (e.g.a human in a database of sequences from a plurality of differenthumans), identifying the sample as being affected by the conditioncomprises identifying the sample as being from the individual to whomthe sequences in the database correspond. In some embodiments,identifying a subject as the source of the sample is based on only afraction of the subject's genomic sequence (e.g. less than 50%, 25%,10%, 5%, or less).

The presence, absence, or abundance of particular sequences or taxa canbe used for diagnostic purposes, such as inferring that a sample orsubject has a particular condition (e.g. an illness) if sequence readsfrom a particular disease-causing organism are present at higher levelsthan a control (e.g. an uninfected individual). In another embodiment,the sequencing reads can originate from the host and indicate thepresence of a disease-causing organism by measuring the presence,absence, or abundance of a host gene in a sample. The presence, absence,or abundance can be used to infer effectiveness of a treatment, whereina decrease in the number of sequencing reads from a disease-causingagent after treatment, or a change in the presence, absence, orabundance of specific host-response genes, indicates that a treatment iseffective, whereas no change or insufficient change indicates that thetreatment is ineffective. The sample can be assayed before or one ormore times after treatment is begun. In some examples, the treatment ofthe infected subject is altered based on the results of the monitoring.

In some cases, one or more samples having a known condition may be usedto establish a biosignature for that condition using a method of thedisclosure. The biosignature may be established by associating thepresence, absence, or abundance of the plurality of taxa with thecondition. The condition can be any condition described herein. Forexample, a plurality of samples from a particular environmental sourcemay be used to identify sequences and/or taxa associated with thatenvironmental source, thereby establishing a biosignature consisting ofthose sequences and/or taxa so associated. Various examples are providedelsewhere herein. In one particular example, a sample (e.g. from anindividual or a cell culture) is identified as being infected by aninfectious agent based on only a host gene expression biosignature, onlyon identification of one or more sequences associated with theinfectious agent, or a combination of the two. In cases where both hosttranscripts and infectious agent sequences are used in identifying acondition, the condition so identified may be that of a passive carrier(e.g. where viral sequences are detected, but a host immune response isnot).

The method may further comprise any of isolating polynucleotides from asample, amplifying polynucleotides, and/or sequencing polynucleotides togenerate sequencing reads for comparison, such as by any of the methodsdescribed herein.

In one aspect, the disclosure provides systems for performing any of themethods described herein. In some embodiments, the system is configuredfor identifying a plurality of polynucleotides in a sample from a samplesource based on sequencing reads for the plurality of polynucleotides.For example, the system may comprise a computer processor programmed to,for each sequencing read: (a) perform a sequence comparison between thesequencing read and a plurality of reference polynucleotide sequences,wherein the comparison comprises calculating k-mer weights as measuresof how likely it is that k-mers within the sequencing read are derivedfrom a reference sequence within the plurality of referencepolynucleotide sequences; (b) identify the sequencing read ascorresponding to a particular reference sequence in a database ofreference sequences if the sum of k-mer weights for the referencesequence is above a threshold level; and (c) assemble a record databasecomprising reference sequences identified in step (b), wherein therecord database excludes reference sequences to which no sequencing readcorresponds. As another example, the system may comprise one or morecomputer processors programmed to: (a) for each sequencing read, performa sequence comparison between the sequencing read and a plurality ofreference polynucleotide sequences, wherein the comparison comprisescalculating k-mer weights as measures of how likely it is that k-merswithin the sequencing read are derived from a reference sequence withinthe plurality of reference polynucleotide sequences; (b) for eachsequencing read, calculate a probability that the sequencing readcorresponds to a particular reference sequence in a database ofreference sequences based on the k-mer weights, thereby generating asequence probability; (c) calculate a score for the presence or absenceof one or more taxa based on the sequence probabilities corresponding tosequences representative of said one or more taxa; and (d) identify theone or more taxa as present or absent in the sample based on thecorresponding scores.

The system may further comprise a reaction module in communication withthe computer processor, wherein the reaction module performspolynucleotide sequencing reactions to produce the sequencing reads.Processors may be associated with one or more controllers, calculationunits, and/or other units of a computer system, or implanted in firmwareas desired. If implemented in software, the routines may be stored inany computer readable memory such as in RAM, ROM, flash memory, amagnetic disk, a laser disk, or other storage medium. Likewise, thissoftware may be delivered to a computing device via any known deliverymethod including, for example, over a communication channel such as atelephone line, the internet, a wireless connection, etc., or via atransportable medium, such as a computer readable disk, flash drive,etc. The various steps may be implemented as various blocks, operations,tools, modules or techniques which, in turn, may be implemented inhardware, firmware, software, or any combination thereof. Whenimplemented in hardware, some or all of the blocks, operations,techniques, etc. may be implemented in, for example, a custom integratedcircuit (IC), an application specific integrated circuit (ASIC), a fieldprogrammable logic array (FPGA), a programmable logic array (PLA), etc.In some embodiments, the computer is configured to receive a customerrequest to perform a detection reaction on a sample. The computer mayreceive the customer request directly (e.g. by way of an input devicesuch as a keyboard, mouse, or touch screen operated by the customer or auser entering a customer request) or indirectly (e.g. through a wired orwireless connection, including over the internet). Non-limiting examplesof customers include the subject providing the sample, medicalpersonnel, clinicians, laboratory personnel, insurance companypersonnel, or others in the health care industry.

In one aspect, the disclosure provides a computer-readable mediumcomprising codes that, upon execution by one or more processors,implements a method according to any of the methods disclosed herein. Insome embodiments, execution of the computer readable medium implements amethod of identifying a plurality of polynucleotides in a sample from asample source based on sequencing reads for the plurality ofpolynucleotides. In one embodiment, the execution of the computerreadable medium implements a method comprising: (a) for each of thesequencing reads, performing a sequence comparison between thesequencing read and a plurality of reference polynucleotide sequences,wherein the comparison comprises calculating k-mer weights as measuresof how likely it is that k-mers within the sequencing read are derivedfrom a reference sequence within the plurality of referencepolynucleotide sequences; (b) for each of the sequencing reads,identifying the sequencing read as corresponding to a particularreference sequence in a database of reference sequences if the sum ofk-mer weights for the reference sequence is above a threshold level; and(c) assembling a record database comprising reference sequencesidentified in step (b), wherein the record database excludes referencesequences to which no sequencing read corresponds.

In another embodiment, the execution of the computer readable mediumimplements a method of identifying one or more taxa in a sample from asample source based on sequencing reads for a plurality ofpolynucleotides, the method comprising: (a) for each of the sequencingreads, performing a sequence comparison between the sequencing read anda plurality of reference polynucleotide sequences, wherein thecomparison comprises calculating k-mer weights as a measure of howlikely it is that k-mers within the sequencing read are derived from areference sequence within the plurality of reference polynucleotidesequences; (b) for each of the sequencing reads, calculating aprobability that the sequencing read corresponds to a particularreference sequence in a database of reference sequences based on thek-mer weights, thereby generating a sequence probability; (c)calculating a score for the presence or absence of one or more taxabased on the sequence probabilities corresponding to sequencesrepresentative of said one or more taxa; and (d) identifying the one ormore taxa as present or absent in the sample based on the correspondingscores.

Computer readable medium may take many forms, including but not limitedto, a tangible storage medium, a carrier wave medium, or physicaltransmission medium. Non-volatile storage media include, for example,optical or magnetic disks, such as any of the storage devices in anycomputer(s) or the like, such as may be used to implement thecalculation steps, processing steps, etc. Volatile storage media includedynamic memory, such as main memory of a computer. Tangible transmissionmedia include coaxial cables; copper wire and fiber optics, includingthe wires that comprise a bus within a computer system. Carrier-wavetransmission media can take the form of electric or electromagneticsignals, or acoustic or light waves such as those generated during radiofrequency (RF) and infrared (IR) data communications. Common forms ofcomputer-readable media therefore include for example: a floppy disk, aflexible disk, hard disk, magnetic tape, any other magnetic medium, aCD-ROM, DVD or DVD-ROM, any other optical medium, punch cards papertape, any other physical storage medium with patterns of holes, a RAM, aPROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, acarrier wave transporting data or instructions, cables or linkstransporting such a carrier wave, or any other medium from which acomputer can read programming code and/or data. Many of these forms ofcomputer readable media may be involved in carrying one or moresequences of one or more instructions to a processor for execution.

EXAMPLES

The following examples are given for the purpose of illustrating variousembodiments of the invention and are not meant to limit the presentinvention in any fashion. The present examples, along with the methodsdescribed herein are presently representative of preferred embodiments,are exemplary, and are not intended as limitations on the scope of theinvention. Changes therein and other uses which are encompassed withinthe spirit of the invention as defined by the scope of the claims willoccur to those skilled in the art.

Example 1: Sample System Architecture

An example system in accordance with an embodiment of the disclosure wasconstructed. An overview of the structure and user interface of thissystem is illustrated in FIGS. 1A-B, and is referred to in theseexamples as Taxonomer. For the various analyses in these examples, rawFASTQ files were the input for Taxonomer, which comprised four mainmodules. The ‘Binner’ module categorized (“bins”) sequencing reads intobroad taxonomic groups (e.g. host and microbial) followed bycomprehensive classification at the nucleotide (‘Classifier’ module) oramino acid-level (‘Protonomer’ and ‘Afterburner’ modules). In thisexample system, the ‘Binner’ module used exact k-mer counting for readassignment with a predefined minimum threshold. The ‘Classifier’ moduleapplied exact k-mer matching and probabilistic taxonomic assignment forhost transcript expression profiling and classification of bacteria andfungi at the nucleotide level. The ‘Protonomer’ module applied 6-frametranslation for virus detection at the amino acid-level. When usingdiscovery mode, sequencing reads that failed amino acid-levelclassification were subjected to the ‘Afterburner’ module, which used areduced amino acid alphabet for increased sensitivity. Defaultclassification databases included ENSEMBL (see Flicek, P. et al. Ensembl2014. Nucleic acids research 42, D749-755 (2014)) (human transcripts),Greengenes (see DeSantis, T. Z. et al. Greengenes, a chimera-checked 16SrRNA gene database and workbench compatible with ARB. Applied andenvironmental microbiology 72, 5069-5072 (2006)) (bacteria), UNITE (seeKoljalg, U. et al. Towards a unified paradigm for sequence-basedidentification of fungi. Molecular ecology 22, 5271-5277 (2013))(fungi), and UniRef90 (see Suzek, B. E., Huang, H., McGarvey, P.,Mazumder, R. & Wu, C. H. UniRef comprehensive and non-redundant UniProtreference clusters. Bioinformatics 23, 1282-1288 (2007)) (viruses,phages). Microbial profiles can be provided in BIOM format (seeMcDonald, D. et al. The Biological Observation Matrix (BIOM) format or:how I learned to stop worrying and love the ome-ome. GigaScience 1, 7(2012)). FASTQ formatted read subsets can be used for custom downstreamanalyses. To further remove barriers for clinical and academic adoptionof metagenomics, a web interface was developed for Taxonomer that allowsusers to stream FASTQ files (local or through http access) to theanalysis server and interactively visualize results in real-time(illustrated in FIG. 1B). Using the streaming web application, more than1×10⁵ paired-end reads can be analyzed and visualized in about 5 secondswith fast Internet connection. Features of Taxonomer are described ingrey boxes. Additional features of Taxonomer are further describedbelow.

The Binner database was created by counting unique 21 bp k-mers indifferent taxonomic or gene datasets. This was done using Kanalyze,version 0.9.7 (see Audano, P. & Vannberg, F. K Analyze: a fast versatilepipelined k-mer toolkit. Bioinformatics 30, 2070-2072 (2014)), but couldhave alternatively used Jellyfish, version 2.3, (see Marcais, G. &Kingsford, C. A fast, lock-free approach for efficient parallel countingof occurrences of k-mers. Bioinformatics 27, 764-770 (2011)). Eachtaxonomic or gene dataset represented a “bin” in which query sequencescould be placed based on their k-mer content. Each database was assigneda unique bitwise flag that allowed k-mers to belong to one or more binsto be recognized and counted. The database bins and flags are shown inFIG. 19 . The k-mer counts were merged into a single binary file withtwo columns, the k-mers and the database flag. Additional columns forother information can be accommodated. The file was sortedlexicographically to optimize for rapid k-mer queries. Reads were thenassigned to the taxonomic group(s) with which the most k-mers wereshared. Some reference sequence databases are subsets or overlap withothers (e.g. ‘Human transcripts’ and ‘Human genome’) and some sequencesmay be assigned varying taxID's (e.g. phage sequences may be annotatedas viruses or as bacteria, if integrated as prophages). As a result,query sequences may share an equal number of k-mers with more than onereference database. The ‘Binner’ module assigns these query sequences asoutlined below in Table 1. For web display, sub-bins were displayed aspart of a larger bin, the organization of which is summarized forvisualization in Table 2.

TABLE 1 Bin assignment for reads with equal numbers of k-mer matches tomultiple Binner databases and k-mer matches below threshold. Equal k-mercount of . . . And . . . Assignment ‘Human ‘Human genome’ and/or ‘Humantranscripts’ ‘Mitochondrial genomes’ transcripts’ ‘Bacterial 16S’‘Bacterial LSU’ and/or ‘Bacterial ‘Bacterial 16S’ genomes’ and/or‘Plastids LSU/SSU’ ‘Fungal ITS’ ‘Fungal genomes’ and/or ‘Fungal ‘FungalITS’ LSU/SSU’ ‘Phage’ ‘Viruses (NCBI)’ and/or ‘Bacterial ‘Phage’genomes’ All other ties ‘Ambiguous' K-mer count < ‘Unknown’ threshold

TABLE 2 Contents of visualized pie charts in the web portal. BinSub-bins Human ‘Human genome’, ‘Human transcripts’, ‘Mitochondrialgenomes’ Bacterial ‘Bacterial genomes’, ‘Bacterial SSU’, ‘BacterialLSU’, ‘Plastids LSU/SSU’ Fungal ‘Fungal genomes’, ‘Fungal LSU/SSU’,‘Fungal ITS’ Viral ‘Viruses (NCBI)’, ‘Phage’ Other ‘Other EukaryotesLSU/SSU’ Ambiguous Any database combination not specified above

High binning accuracy was achieved through minimal intersections (0.47%)of k-mer content from comprehensive human and microbial referencedatabases (FIG. 6A-6B). Optimal k-mer cutoffs were determined byreceiver operator characteristics analysis using Youden's indexes and F1scores (see Akobeng, A. K. Understanding diagnostic tests 3: Receiveroperating characteristic curves. Acta Paediatr 96, 644-647 (2007)) andranged from 3 to 13 (Table 3, default, n=11). A default k-mer of 11 waschosen for the Binner module based on these results.

TABLE 3 Optimal k-mer cutoffs for bin assignments based on the Youden'sIndex and F1 Score. Youden’s Index F1 Score Human 13 13 Bacteria 5 8Fungal 3 4 Virus 3 4 Parasite 22 21

In order to eliminate binning based on reads containing adaptersequence, an adapter database can be provided; Binner can ignore k-merspresent in the adapter database. In this example, Binner ignored k-merspresent in Illumina TruSeq adapters. Furthermore, a database ofspiked-in control sequences can be provided (e.g. database of ExternalRNA Controls Consortium (ERCC) control sequences) to allowquantification of spike-in controls.

Classifier was used to identify the source of sequences after thesequences were subset by Binner. Classifier identified the source of asequence based on exact k-mer matching. The k-mer weight for referencesequence was calculated in accordance with Equation 1 and reads wereassigned to a reference sequence based on the sum of k-mer weights. Inthe case of a tie, the query sequence was assigned to the taxonomiclowest common ancestor (LCA) taking into account the read's total k-merweight along each branch of the phylogenetic tree.

Sequence reads that were not classified above a threshold by Classifierand sequences that binner had placed in the viral category wereadditionally processed by Protonomer. Reads were translated in all sixreading frames and then reverse-translated using a non-degeneratetranslation scheme. The UniRef90 protein database was reverse-translatedusing the same non-degenerate translation scheme. The reverse-translatedsequences for each read were compared to the reverse-translated UniRef90database with 30-bp k-mers (corresponding to 10 amino acids) inaccordance with Equation 1 as described above.

To increase recovery of distantly homologous proteins, Taxonomeremployed the Afterburner module, a degenerate k-mer matching engine thatemploys a collapsed amino-acid alphabet. Afterburner used k-meansclustering on the BLOSUM62 matrix to generate a compressed amino acidalphabet (see FIG. 8 ). This compressed alphabet results in highersensitivity in classification with sequences that are more diverged atthe expense of a higher false positive rate when compared withProtonomer.

Example 2: Sample Web-Service and Implementation

In this example, a web-service and implementation for Taxonomer asdescribed in Example 1 are described. Complex metagenomic data can beprocessed quickly and effectively interpreted through web-basedvisualizations (FIG. 1B illustrates such an interface).

As reads were being streamed to the analysis server, a pie chart waspresented summarizing the results of the binning procedure. When one ofthe bacterial, fungal, viral, or phage bins of the pie chart wasselected, the results of the Classifier/Protonomer modules weredisplayed in a sunburst visualization.

Additional information was provided at the top of the web page about howmany reads were sampled, the number of reads classified, and thedetection threshold. The detection threshold informs a user about howabundant a particular organism must be in order to be detected with thenumber of reads sampled, thereby providing an indicator of thesensitivity of detection in the sample. In addition, a slider allowedthe user to select an absolute cutoff for the minimum number of readsrequired in order to be displayed in the sunburst.

Example 3: Database Construction for Taxonomer

In this example, construction of databases for Taxonomer as described inExample 1 are described. The Classifier and Protonomer databases aremodular, consisting only of multi-fasta files with a ‘parent tag’ ontheir definition lines. These tags describe each reference sequence'simmediate phylogenetic parent-taxon.

Bacterial classification was based on a marker gene approach. The markergenes were 16S rRNA gene and genes from the Greengenes database(reference set with operational taxonomic units, OTU, clustered at 99%,version 13_8, FIG. 19 ). This reference set contained 203,452 OTUclusters from 1,262,986 reference sequences. The taxonomic lineage foreach OTU was used to create a hierarchical taxonomy map to represent OTUrelationships. To support the OTU ‘species’ concept, the taxonomy wascompleted for ranks in the taxonomic lineage that had no value. Uniquedummy species names from the highest taxonomic rank available were usedto fill empty values. Versions of the Greengenes database were formattedfor use within BLAST, the RDP Classifier, and Kraken.

Fungal classification was also based on a marker gene approach. Themarker genes were internal transcribed spacer, ITS, rRNA sequences, andthe UNITE database (see Koljalg, U. et al. Towards a unified paradigmfor sequence-based identification of fungi. Molecular ecology 22,5271-5277 (2013)) (version sh_taxonomy_qiime_ver6_dynamic_s_09.02.2014,FIG. 19 ). This reference set contained 45,674 taxa (species hypothesis,SH) generated from 376,803 reference sequences with a default-clusteringthreshold of 98.5% and expert taxonomic curation. Dummy names werecreated for ranks that had no value. Versions of the UNITE database wereformatted for use with BLAST, the RDP Classifier, and Kraken.

The viral protein database was created by using UniRef90 (see Suzek, B.E., Huang, H., McGarvey, P., Mazumder, R. & Wu, C. H. UniRef:comprehensive and non-redundant UniProt reference clusters.Bioinformatics 23, 1282-1288 (2007)) downloaded on Jun. 16, 2014. Thedatabase was reduced to 289,486 viral sequences based on NCBI taxonomy.Phage sequences were separated, leaving a total of 200,880 referencesfor other viruses. NCBI taxonomy was used to determine the sequencerelationship.

For testing purposes, an additional bacterial classification databasewas constructed from RefSeq (identical to Kraken's full database;n=210,627 total references; n=5,242 bacterial references, using NCBItaxonomy), and the complete ribosomal database project databasesdownload on Sep. 24, 2014 (n=2,929,433 references, using RDP taxonomy).

Databases were constructed to maximize query speed. K-mers were storedin lexicographical order, and k-mer minimizers are used to point toblocks of k-mers in the database. Once a block of k-mers is isolated, abinary search was used to complete the query. In addition to storing theLCA of a k-mer, we also stored the k-mer count and every reference (upto an adjustable cutoff) with associated k-mer weight.

The Binner database consisted of two binary files: One with extension“.bmi”, the other with extension “.btbi”. The file with extension “.bmi”contained information about k-mer minimizers and pointers to blocks ofk-mers in the “.btbi” file. The “.bmi” file contained rows with thefollowing format:

TABLE 3a Variable types in .bmi file Variable type Variable meaninguint64_t k-mer minimizer uint64_t Number of k-mers in block indexed bythis minimizer uint64_t Byte offset to beginning of k-mer block in“.btbi” file

The “.btbi” file had a header that is 176 bits. The header consisted ofthe following values and C variable types:

TABLE 4 Variable types in .btbi file Variable type Variable meaninguint8_t k-mer length (<32) uint8_t k-mer minimizer length (<16) uint64_tUnique k-mer count uint64_t Total k-mer count int k-mer cutoff int taxIDcutoff

Every k-mer block indexed by the “.bmi” file started with the followingrow in the “.btbi” file:

TABLE 5 Structure of starting row of k-mer block indexed in .btbi fileVariable type Variable meaning uint64_t* Byte offsets to individualk-mers in this block

All other rows after first row in k-mer block had the following format:

TABLE 6 Structure of all other rows of k-mer block indexed in .btbi fileVariable type Variable meaning uint64_t k-mer uint64_t k-mer databasedesignation

The Classifier database consisted of 3 binary files with the followingextensions: “.mi”, “.tbi”, and “.rsi”. The file with extension “.mi”contained information about k-mer minimizers and pointers to blocks ofk-mers in the “.tbi” file. The “.mi” file contains rows with thefollowing format:

TABLE 7 Structure of rows in .mi file Variable type Variable meaninguint64_t k-mer minimizer uint64_t Number of k-mers in block indexed bythis minimizer uint64_t Byte offset to beginning of k-mer block in“.tbi” file

The “.tbi” file has a header that is 176 bits. The header consisted ofthe following values and C variable types:

TABLE 8 Structure of the header row in .tbi file Variable type Variablemeaning uint8_t k-mer length (<32) uint8_t k-mer minimizer length (<16)uint64_t Unique k-mer count uint64_t Total k-mer count int k-mer cutoffint taxID cutoff

Every k-mer block indexed by the “.mi” file started with the followingrow in the “.tbi” file:

TABLE 9 Structure of non-header rows in .tbi file Variable type Variablemeaning uint64_t* Byte offsets to individual k-mers in this block

All other rows in a k-mer block after first row of the k-mer block hadthe following format in the “.tbi” file:

Variable type Variable meaning uint64_t k-mer uint64_t k-mer countuint64_t Byte offset to taxids and k- mer weights in “.rsi” file

The “.rsi” file had the following format for each row:

TABLE 10 Structure of rows in .rsi file Variable type Variable meaninguint64_t taxID count uint64_t* taxIDs uint64_t* k-mer weights for eachtaxID

The Taxonomer database included individual k-mer weights for everytaxID. This allowed Taxonomer to accumulate these weights across readsto increase both sequence query assignment sensitivity and specificity.

Example 4: Comparison of Taxonomer to SURPI

In this example, the performance of Taxonomer described in Example 1 wascompared to SURPI (see e.g. Naccache, S. N. et al. A cloud-compatiblebioinformatics pipeline for ultrarapid pathogen identification fromnext-generation sequencing of clinical samples. Genome research 24,1180-1192 (2014)). Taxonomer used a non-greedy binning algorithm, asopposed to SURPI, which employs a greedy digital subtraction algorithm(see FIG. 7 ). The data analyzed in this example came from one of the 33pediatric respiratory tract samples shown in FIG. 6D (RNA) and anadditional nasopharyngeal sample (DNA). Of the reads classified as humanby SURPI, 1% were classified by Taxonomer as fungal, to lower resolution(11%) or cannot confidently be differentiated between closely relatedbins (23%) when using a simultaneous binning strategy.

While high-level taxonomic assignments made by the two algorithms agreedfor 73.8% of reads, Taxonomer assigned 16% of reads to an ambiguousorigin (matching equally to multiple databases), whereas 96% of thesewere classified as human by SURPI. This was mostly due to highlyconserved ribosomal and mitochondrial sequences, but similar effectswere also apparent for fungal sequences, 18% of which classified ashuman by SURPI.

Taxonomer's alignment-free binning approach was able to capture morephage/viral sequences (7,426) than the alignment-based method (5,798),and resulted in fewer unclassified sequencing reads (3.2% vs. 4.5%).Consistent with the lower abundance of rRNA and mtRNA sequences in DNAsequencing data, Taxonomer had many fewer ambiguous assignments in theDNA dataset than the RNA dataset (0.04%, of which 40% were classified ashuman and 59% as viral by SURPI; overall agreement 98.7%). In additionto decreased numbers of false negatives, Binner also provides users ofthe Taxonomer web-service with a high-level overview of the contents ofeven the largest and most complicated dataset within the first second orso of computation.

Example 5: Assessment of Analysis Time and Completeness ofClassification

In this example, the performance of Taxonomer described in Example 1 wascompared to Kraken and SURPI. FIG. 18 presents time and classificationpercentages for Taxonomer, Kraken, and SURPI. For this analysis, anRNA-Seq data from three virus-positive respiratory tract samples with arange of host vs. microbial composition profiles was used (see e.g.Graf, E. H. Evaluation of Metagenomics for the Detection of RespiratoryViruses Directly from Clinical Samples. (2015)).

Kraken was the fastest tool; it required about 1.5 min/sample onaverage. However, possibly due to its reliance on nucleic acid-levelclassification only and use a single reference database, Krakenclassified fewer reads than Taxonomer or SURPI. SURPI enabled aminoacid-level searches for virus detection and discovery, but this greatlyextended analysis times to between 1.5 and >12 hours per sample. LikeSURPI, Taxonomer provided both nucleic acid and protein-based microbialclassification, but Taxonomer also created a host-expression profile.Taxonomer achieved times similar to Kraken, requiring on average ˜5minutes to classify 5-8×10⁶ paired-end reads using 16 CPUs. Moreover,Taxonomer classified the largest number of reads in 2 of the 3 samplesand tied with SURPI for the third sample.

Taxonomer provided fast, and effective means for read and contigclassification, was substantially more accurate than the fastestavailable tools (Kraken and SURPI), and achieved accuracies on 16Samplicon data that closely approach the current standard, RDP. This wasfacilitated by Taxonomer's comprehensive databases, its k-mer weightapproach, and its ability to carry out nucleotide and protein-basedsearches and classification within a single integrated algorithmicframework. On the datasets tested, Taxonomer was hours faster than SURPIand days faster than RDP. 16S sequences (but not synthetic reads derivedfrom other genomic targets) from the same unrepresented bacteria arealmost always correctly binned by Taxonomer (but not erroneouslyclassified; see FIG. 6 ), highlighting the advantages of Taxonomer'smarker gene-based approach, both for discovery of novel organisms andfor avoiding misclassification pitfalls. FIG. 6B also shows receiveroperator characteristics (ROC) curves for classification of human andmicrobial sequences by the ‘Binner’ module. A total of 1×10⁶ synthetic100 bp reads (80% human, 10% bacterial, 5% fungal, 1% viral, and 4% fromparasites; 1% error rate) were analyzed with the ‘Binner’ module andinterpreted for correct bin assignment to calculate sensitivity andspecificity using minimum k-mer count thresholds for read binningranging from 1 to 40. Boxed and circled thresholds represent optimalcutoffs as determined by F1 score and Youden's index, respectively (seeTable 3). FIG. 6C shows that the sensitivity for binning of bacterialand viral reads can be low for phylogenetically distant species.Synthetic bacterial and viral reads were generated from single-cellsequencing-based draft bacterial genomes, bacterial genome scaffoldsderived from metagenomic sequencing data, and recently published genomesequences. Sensitivity for correct binning (vs. assignment as ‘Unknown’)can be low for bacteria (median 2.1%, 5.4%, and 64.9%, respectively) andviruses (median 22.1%, 0% for n=56 of 199 viral genomes) not representedin the Binner database. In contrast, 16S sequences from the sameunrepresented bacteria are almost always correctly binned (median 100%).This highlights the conservation of the 16S rRNA marker gene and thegreater completeness of reference databases. As a result, organisms arestill identified as present within the sample and can be placed withinphylogenetic context. FIG. 6D shows the relative read abundance ofdifferent taxonomic bins as determined by the ‘Binner’ module for 33pediatric respiratory tract samples positive for at least onerespiratory virus, including median and interquartile range (IQR) foreach bin (6.3×10⁶±2×10⁶ reads/sample). Relative abundances vary greatlyfor all bins but reached almost 4 orders of magnitude for the viral andfungal bins. A median of only 1% (IQR 0.4-2%) of reads could not beassigned a bin (unknown). A median of 9% of reads were derived fromhuman mRNA, supporting the idea that host transcript expressionprofiling can be performed using total RNA-seq from nasopharyngealsamples.

Example 6: Bacterial and Fungal Classification

In this example, the embodiment of Taxonomer described in Example 1 wasused to classify reads derived from bacterial and fungal samples. Acomprehensive classification database can mitigate errors resulting fromimperfect matches from query sequences to databases. The choice ofdefault reference database can affect the specificity and sensitivity ofa classifier. One solution is to use RefSeq, but the version of RefSeq(at the time of access) only contained some 5,000 sequenced bacterialtaxa, whereas available 16S rRNA sequences suggest existence of at least100,000 to 200,000 OTUs given existing sequence databases. Reads derivedfrom taxa that are absent from the classification database can result infalse negative and false positive classifications, especially at thegenus and species level (FIG. 11 ).

FIG. 11 shows that query sequences not represented in the referencedatabase cause false-positive and false-negative classifications, andthat Taxonomer is less affected than other tools. FIG. 11A shows theread-level classification accuracy for synthetic reads simulated (20×coverage) from SILVA references (n=10,000) with identical representationin the reference database as classified by BLAST, the RDP Classifier,Kraken, and Taxonomer. While only 84.2% (BLAST), 85.2% (RDP), 64.9%(Kraken), and 83.7% (Taxonomer) of reads are classified to the specieslevel (an effect of highly conserved regions of the 16S gene notallowing species-level assignment), false-positive rates are minimal forall classification algorithms, 0.4% (BLAST), 0.7% (RDP), 0.02% (Kraken),and 0.1% (Taxonomer). FIG. 11B shows the same analysis with SILVAreferences (n=10,000), for whom highly similar, but non-identicalreferences (97% to 98.99% pairwise sequence identity based onfull-length MegaBLAST) are present in the reference database.Proportions of reads with species-level classification drop to 39.1%(BLAST), 49.0% (RDP), 26.9% (Kraken), and 47.4% (Taxonomer), and 5.3%(BLAST), 5.1% (RDP), 10.2% (Kraken), and 13.7% (Taxonomer) of reads areclassified to taxa that are different from the source of the syntheticreads. FIG. 11C shows that this effect is even more pronounced forsynthetic reads simulated from SILVA references (n=10,000) that onlyshare 90% to 96.99% pairwise sequence identity with the closest match inthe reference database (based on full-length MegaBLAST). In this case,species-level classification was not possible by the commonly useddefinition, and even genus-level classification drops to 33.0% (BLAST),40.8% (RDP), 32.1% (Kraken), and 38.8% (Taxonomer). At thespecies-level, 22.1% (BLAST), 51.5% (RDP), 55.7% (Kraken), and 66.4%(Taxonomer) of reads are assigned to taxa other than those they weresimulated from. All studies were performed with 250 bp paired-end 16SrDNA reads simulated at 20× coverage from randomly selected SILVAreferences with no error.

TABLE 11 Broad taxonomic classification of read 1 versus read 2 by SURPIdiffers for 2-9% of mate pairs. Broad taxonomic classification by SURPI(as per FIG. 2D) was determined for read 1 and read 2 of pairedsynthetic reads (SILVA 119) and RNA-Seq data (samples from FIG. 1B,limited to pairs passing quality filters, see methods). Broad taxonomicassignments were compared for concordance. Discordance ranged between2-3% for synthetic 16S read pairs and from 3-9% for RNA-Seq data.Discordance was greatest for samples with higher abundance of bacterialreads (samples 2 and 3, FIG. 1B), presumably due to databaseincompleteness, inconsistent annotations, and because SURPI's assignmentis based on the single reference sequence with the highest score. ReadRead pairs with discordant Total Sample Length assignment, R1 vs. R2 (n)pairs (n) % Synthetic 16S 2 × 100 bp 6,984 300,128 2.3 Synthetic 16S 2 ×250 bp 2,888 119,009 2.4 Sample 1 (FIG. 2 × 100 bp 172,759 5,916,921 2.918) Sample 2 (FIG. 2 × 100 bp 586,486 6,261,301 9.4 18) Sample 3 (FIG. 2× 100 bp 326,263 5,536,276 5.9 18)

Performance of classification tools is frequently only tested withsynthetic reads derived from the reference database; such that perfectmatches exist for all synthetic reads. For microbial classification,this is a highly artificial challenge, as novel species or strains areroutinely encountered in clinical or environmental samples. To provide amore realistic challenge, synthetic reads were generated from bacterial16S rRNA sequences in the SILVA database lacking perfect matches inTaxonomer's Greengenes-derived reference database (468 of 1013 sourcereferences, 46%, had no perfect match in the classification database,Table 12). Taxonomer employed a marker gene approach and a customGreengenes-derived database for prokaryotic classification.Classification of the synthetic reads by Taxonomer, SURPI, and Krakenwas compared using each tool's default settings and databases: nt(SURPI), RefSeq (Kraken), and Greengenes 99% OTU (Taxonomer). Krakenreports the taxon identifier for each read's final taxonomic assignment.An accessory script (Kraken-filter) can be used to apply confidencescores, although it was found that this value had little impact onresults of the benchmarks (see FIG. 10 ). SURPI reports the best hit forits mapping tools, (SNAP, RAPSearch2), which were used for comparison.The results of this comparison (FIG. 2A) show that at the species level,for example, Taxonomer correctly classified 59.5%, incorrectlyclassified 15.7%, and failed to classify 24.8% of the reads. Bycomparison, Kraken classified 29% of the reads to the correct species,and exhibited a high false-positive rate, classifying every remainingread (71%) incorrectly. The results for SURPI have been split into twocolumns to reflect the fact that SURPI, unlike Taxonomer and Kraken,classifies each read from mate-pair reads independently, and in manycases these assignments are discordant (Table 11). Thus, the right-handportion of the SURPI column records the classification rates when eitherread from a mate pair was classified correctly; the left-hand portion,records the rates for classifying both mates to the same taxon. As canbe seen, SURPI underperforms both Taxonomer and Kraken.

TABLE 12 Sequence identity of full-length SILVA references used togenerate synthetic read sets for FIGS. 2A-D and FIGS. 8-10 compared tothe most similar reference sequence in the ‘Classifier’database.Synthetic read sets were constructed using 1,013 randomly selectedbacterial 16S sequences from the SILVA (release 119) database. The samefull-length SILVA references were compared to the ‘Classifier’referencedatabase (Greengenes, 99% OTU clustering) using BLAST to determinesequence identity. Almost half of the SILVA reference sequences usedhave only imperfect matches in the ‘Classifier’reference database. Onlyreference with ≥97% sequence identity were used to construct syntheticread sets for FIGS 2, 10, 12, and 13. % ID (SILVA vs. Greengenes) n %100 545 53.8  99.5-99.99 261 25.8    99-99.49 117 11.5  98.5-98.99 313.1   98-98.49 26 2.6  97.5-97.99 22 2.2   97-97.49 11 1.1 Total 1,013100

In order to show the effect of different databases on Taxonomer, thesynthetic reads produced above were classified using RefSeq, the Krakendefault, RDP, or Greengenes (Taxonomer default) databases (FIG. 2B).Using its default database, Taxonomer correctly classified 59.5% of thereads, and recovered 94.9% of species. Using Kraken's default database(RefSeq DB), Taxonomer's correctly classified 27% of the reads andrecovered 71.6% of the species, similar to Kraken's results when usingthe same database: 29% and 71%, respectively. Also presented in FIG. 2Bare Taxonomer's classification and recovery rates using the RDP database(RDP DB). For classification by RDP, classifications were resolved tothe rank with a minimum confidence level of ≥0.5. Although Taxonomermisclassified very few reads using the RDP database, overall performancewas substantially better using Taxonomer's default database.

The four classification tools, MegaBLAST, the RDP Classifier, Kraken,and Taxonomer, were compared using Taxonomer's default 16S database. Forthis example, default MegaBLAST parameters were used. Top scoringreferences were identified and used to assign operational taxonomicunits (OTUs) or species hypotheses (SHs) Multiple OTUs/SHs were assignedto reverse-translated reads when more than one OTU/SH reference shared100% identity. If no OTU/SH had 100% identity to a read than all OTUswithin 0.5% of the top hit were assigned to the read. The taxonomy ofthe assigned OTUs/SHs was compared and the highest rank in common wasused to assign a taxonomic value to the read. The percent identity wasused to determine the assignment of the highest taxonomic rank. Sequencereads with >97% identity to a reference were assigned to a species, >90%identity to a genus, and <90% to a family when lineage information wasavailable at this rank. For classification by RDP, classifications wereresolved as above to the rank with a minimum confidence level of ≥0.5.SURPI was not included in the comparison because there is no option toemploy a user-provided database. As shown in FIG. 2C, Taxonomer'sperformance in classifying the simulated bacterial reads closelyapproximated that of the RDP Classifier, an established reference tool.At the species level, Taxonomer and RDP classified 59.5% and 61.4% ofreads correctly, and recovery rates are very similar. Although Kraken'sclassification and recovery rates improved dramatically usingTaxonomer's database compared to its own, Taxonomer still correctlyclassified 13.5% more reads compared to Kraken (59.5% vs. 46%) and alsohad a lower false positive-rate (15.7% vs. 20.1%). Taxonomer alsooutperformed Kraken on taxon recovery rate (94.9% vs. 83%), andTaxonomer's false-recovery rate was lower as well (23.3% vs. 37.9%). Weexamined the impact of read length (FIG. 12 ) and sequencing errors uponclassification accuracy (see FIG. 13 ). FIG. 12 shows the read-level(top) and taxon-level (bottom) bacterial classification accuracy ofBLAST, the RDP Classifier, Kraken, and Taxonomer run using theGreengenes 99% OTU database. The reads were either using 100 bpsingle-end (FIG. 12A) or (B) 100 bp paired-end (FIG. 12B) 16S rDNA readssimulated at 5× coverage from 1,013 randomly selected SILVA referenceswith ≥97% sequence identity to reference sequences. The performance ofTaxonomer was comparable to the RDP Classifier and superior to Kraken,whereas given the applied criteria BLAST was less sensitive but morespecific. FIG. 13 shows family, genus and species level classificationaccuracy for BLAST, the RDP Classifier, Kraken and Taxonomer using thesame read-length and database across error rates of 0.01%, 0.1%, 1%, 5%,and 10%. Performance improved for all tools as a function of readlengths. Taxonomer and Kraken were both more sensitive to sequencingerrors than BLAST and the RDP Classifier due to their reliance on exactk-mer matches. Nevertheless, these same analyses demonstrate thatTaxonomer's nucleotide classification algorithm is error resistant, withTaxonomer achieving greater classification accuracies than Kraken forsequences with less than 5% errors. FIG. 2D shows classification andrecovery rates using Taxonomer's fungal database. As can be seen, thesame general trends are seen in both FIG. 2C-D, demonstrating thatTaxonomer's performance advantages are not restricted to bacterialclassification.

TABLE 13 Accession numbers for published 16S amplicon data used in forbacterial abundance estimates (FIG. 2E), numbers of reads, and analysistimes for the RDP Classifier and Taxonomer. Number of reads forreference (b) is based on mate pairs. RDP Classifier Taxonomer SampleSource Ref. Reads [min] [min] ERR498444 Human gut a 20,469 285 0.91ERR498459 Human gut a 8,413 303 0.45 ERR498467 Human gut a 16,864 4260.62 ERR498476 Human gut a 19,066 402 0.96 ERR498532 Human gut a 20,458315 1.02 ERR498541 Human gut a 18,803 354 0.78 ERR498566 Human gut a12,612 200 0.60 ERR498576 Human gut a 10,070 258 0.49 ERR498611 Humangut a 19,506 342 0.96 ERR498653 Human gut a 14,311 225 0.61 ERR502969Dog nose b 62,836 492 1.26 ERR502989 Human nose b 79,093 930 2.37ERR503004 Kitchen floor b 74,061 594 2.00 ERR503007 Human hand b 67,144615 1.41 ERR503052 Human nose b 54,569 468 0.87 ERR503054 Human hand b77,382 822 2.67 ERR503166 Bedroom floor b 57,718 498 1.10 ERR503209Kitchen floor b 64,996 534 1.91 ERR503211 Bathroom door b 70,964 6301.44 knob ERR503212 Human nose b 124,363 852 3.27 Here Ref (a) and Ref(b) refer to the publications from which the data was derived: (Ref(a)was Subramanian, S. et al. Nature 510, 417-421 (2014); Ref (b) wasLax, S. et al. Science 345, 1048-1052 (2014)).

Since quantifying microbial community composition is a frequent goal ofmetagenomics studies, we also compared Taxonomer's bacterial abundanceestimates to those of the RDP Classifier using recently published 16Samplicon sequencing data (see Table 13) and RNA-Seq-based metagenomics(FIG. 2E). Taxonomer's abundance estimates were highly correlated withRDP's across taxonomic levels for all three datasets. Taxonomer providedhighly comparable community profiles at >200-fold increased speed(Spearman correlation coefficient: r2=0.955 for 2×100 bp reads); averageof 1,630,923 2×100 bp reads/sample; average run times were 27.4 minutes(Taxonomer) versus 120.7 hours (RDP Classifier) on 1 CPU. FIGS. 14A and14B show RNA-Seq metagenomics results comparing Taxonomer Kraken usingthe Greengenes 99% OTU reference database. Correlation of Kraken withabundance estimates based on the RDP Classifier were weaker (Spearmancorrelation coefficient: r²=0.891 2×100 bp reads); average run timeswere 42 seconds/sample. FIG. 14C and FIG. 14D show 16S rRNA geneamplicon sequences of variable region 4 from two published data setsgenerated on HiSeq2000 (dark green, 1×150 bp reads) and MiSeqinstruments (light green, 2×150 reads). The correlation of abundanceestimates (limited to taxa with relative abundance >0.10% per sample)are shown for Taxonomer and the RDP Classifier (FIG. 14C, Spearmancorrelation coefficients: r²=0.858 for 1×150 bp reads, and r²=0.826 for2×150 bp reads). The average number of reads per sample was 44,685 andthe average processing times (using 1 CPU) were 1:28 minutes forTaxonomer and 7.9 hours for the RDP Classifier. The correlation ofabundance estimates as determined Kraken using the Greengenes 99% OTUreference database with abundance estimates based on the RDP Classifierwere weaker (Spearman correlation coefficient: r²=0.697 for 1×150 bpreads, and r²=0.810 for 2×150 bp reads); average run times were 2.5seconds/sample. Spearman Correlation coefficients (p) were 0.96 and0.997 (order) and 0.858 and 0.826 (genus) for 16S amplicon data as wellas 0.992 (order) and 0.955 (genus) for RNA-Seq (FIGS. 2E and 14 ).However, Taxonomer's average analysis times were 260 to 440-fold faster(FIG. 2E and FIG. 15 ). Collectively, these benchmarks illustrate theimportant role of Taxonomer's classification databases and the power andspeed of its classification algorithm.

Example 7: Viral Classification

In this example, the embodiment of Taxonomer described in Example 1 wasused to classify reads derived from viral sources. RNA-Seq data from 24samples known to harbor particular respiratory viruses was used. Themean pairwise, genome-level sequence identities of the 24 respiratoryviruses to reference sequences in the NCBI nt database were 93.7%(range: 75.9-99.8%; see Table 14 and FIG. 8A). The sequencing reads fromeach sample were binned by Binner, and the ‘viral’ and ‘unclassified’bins (see FIGS. 1A and 6C) were taxonomically classified by Protonomer,RAPSearch2 (default and fast settings), and DIAMOND (default andsensitive settings). RAPSearch 2 is employed by SURPI (see Zhao, Y.,Tang, H. & Ye, Y. RAPSearch2: a fast and memory-efficient proteinsimilarity search tool for next-generation sequencing data.Bioinformatics 28, 125-126 (2012)), whereas DIAMOND is an ultrafast,BLAST-like protein search tool (see Buchfink, B., Xie, C. & Huson, D. H.Fast and sensitive protein alignment using DIAMOND. Nature methods(2014)). Protonomer showed a sensitivity of 94.6±2.7%, RAPSearch2 had asensitivity of 95.0±2.2% in default mode and 94.8±2.2% in fast mode;both were more sensitive than DIAMOND, which had a sensitivity of90.5±2.7% in default mode and 90.5±2.7% in sensitive mode. Conversely,Protonomer (90.7±17.1%) and DIAMOND (default: 92.0±17.1%, sensitive:91.9±14.9%) provided significantly higher specificity than RAPSearch2 indefault mode (88.0±20.0%). Protonomer classified reads faster thanRAPSearch2 (24-fold compared to default mode, 11-fold compared to fastmode) and DIAMOND (2.6-fold compared to default mode, 3.3-fold comparedto sensitive mode) on a computer with 16 central processing units.Protonomer demonstrated the best overall performance, being moresensitive (median 94.6%) than DIAMOND (90.5%) and more specific (90.7%)than RAPSearch2 (88.0%) (FIG. 3A-C). True viral reads were determined bymapping all reads to a manually constructed viral consensus genomesequence for each sample. As expected, sensitivity for all toolscorrelated with pairwise identities of viral genome to referencesequences, with DIAMOND being most vulnerable to novel sequencepolymorphisms (FIG. 8B). Of note, DIAMOND does not support jointanalysis of paired sequencing reads. In this comparison, the results ofthe mate pair with the lowest E-value were used rather than reconcilingresults of read mates, which likely resulted in optimistic performanceestimates for DIAMOND. Protonomer is also the fastest of the threetools, classifying 10⁴ to 10⁶ reads/sample (Protonomer: 14 seconds;DIAMOND: 37 seconds in default and 46 seconds in sensitive mode;RAPSearch2: 343 seconds in default and 169 seconds in rapid modes, FIG.8 ).

Taxonomer was further used to analyze published RNA-Seq data from threepatients in whom viral pathogens with significance to public health weredetected: a serum sample from a patient with hemorrhagic fever caused bya novel rhabdovirus (Bas Congo Virus, FIG. 3D); a throat swab from apatient with avian influenza (H7N9 subtype, FIG. 3E), and a plasmasample from a patient with Ebola virus (FIG. 3F). Taxonomer detected therelevant viruses in all three cases, even after removal of the targetsequence from the reference database, thus demonstrating the utility ofTaxonomer for rapid virus detection and discovery in public healthemergencies. Its web-based deployment enables the rapid sharing andreview of analysis results, even across great geographic distances.

TABLE 14 Viruses, percent nucleotide-level identity to referencesequences in the NCBI nt database, as well as numbers of total and viralreads for pediatric upper respiratory tract specimens used to compare‘Protonomer’, RAPSearch2, and DIAMOND for protein-level classificationof viral sequences (see FIGS. 3, 8 & 9). HCoV-human coronavirus, HBoV-human bocavirus, HMPV-human metapneumovirus, HRV-rhinovirus, PIV-parainfluenza virus, RSV-respiratory syncytial virus. Nucleotide GenBankTotal Target Target Virus ID Accession Reads Reads (n) Reads (%) HCoV(HKU1) 99.8% KF686344 317,354 305,544 96.3% HCoV (NL43) 99.8% JQ76556744,825 20,800 46.4% HCoV (0C43) 99.7% AY903460 15,515 6,919 44.6%Coxsackie Virus B4 84.1% KF878966 21,399 1,027  4.8% HBoV 99.6% JQ923422206,869 1,119  0.5% HMPV 98.5% GQ153651 80,362 7,059  8.8% HMPV 99.0%EF535506 55,240 2,683  4.9% HRV-A 90.9% EF173415 11,369 2,413 21.2%HRV-C 85.2% DQ875932.2 490,829 491 0.10% HRV-C 85.3% DQ875932.2 704,819394 0.06% HRV-C 79.3% JF436925.1 662,784 200 0.03% HRV-C 97.3% JX074056385,808 208,446 54.0% HRV-C 82.1% JF317017 306,436 232,451 75.9% HRV-C97.2% JX074056 246,973 35,474 14.4% HRV-C 75.9% KF958311 28,862 2,657 9.2% HRV-C 96.0% JN990702 330,157 252,416 76.5% HRV-C 76.5% GQ223228179,888 153,429 85.3% HRV-C 95.4% GQ323774 58,005 1,369  2.4% PIV-199.2% JQ901989 107,818 9,392  8.7% PIV-3 99.4% KF530232 48,547 15,65132.2% RSV-A 99.7% KF826849.1 762,085 2,218 0.29% RSV-B 97.9% JQ5828431,784 1,035 58.0% RSV-B 97.9% JQ582843 40,707 32,047 78.7% RSV-B 99.7%JN032120.1 516,693 495,469 95.9%

Example 8: Human mRNA Transcript Profiling

In this example, the embodiment of Taxonomer described in Example 1 wasused to profile a host response, which is of growing interest forinfectious diseases testing and for quality-control of cell lines andtissues where microbial contaminants may confound transcript expressionprofiles. Taxonomer is the only ultrafast metagenomics tool with thiscapability. Taxonomer's default databases included ERCC controlsequences, allowing users to normalize transcript counts. By defaultthese reference transcripts and corresponding gene models (GTF file)from the ENSMBL human reference sequence, GRCh37.75. A k-mer of size of20 was used, which works well for mapping reads to human transcripts.

TABLE 15 Accession numbers for human brain RNA-Seq data used to comparewith MAQC qPCR data Sample Source Reads SRR037452 Human brain 11,712,885SRR037453 Human brain 11,413,794 SRR037454 Human brain 11,816,021SRR037455 Human brain 11,244,980 SRR037456 Human brain 12,081,324SRR037457 Human brain 11,365,146 SRR037458 Human brain 11,616,331

Taxonomer's expression profiles were compared to those of standardtranscript expression profiling tools Sailfish and Cufflinks, as well asquantitative PCR. Gene-level Pearson and Spearman correlationcoefficients for RNA-seq versus qPCR were 0.85 and 0.84 for Taxonomer,0.87 and 0.86 for Sailfish, and 0.80 and 0.80 for Cufflinks,respectively. These results showed that Taxonomer's quantification ofsynthetic reads and a commercially available RNA standard (MAQCspecifically, human brain tissue samples, see Table 15) was accurateover a broad range of transcript abundance. Indeed, accuracy wasintermediate between Sailfish's and Cufflink's (FIG. 4A). To demonstratethe utility of Taxonomer's capacity for simultaneous pathogen detectionand transcript expression profiling, Taxonomer was used to analyzeRNA-Seq data from respiratory samples of patients with influenza A virusinfection (n=4) with varying abundance of host versus microbial RNA(FIG. 4B) and compared mRNA expression profiles to those of asymptomaticcontrols (n=40). Influenza A virus was detected in all samples byTaxonomer (FIG. 4C). Normalized gene-level expression of the 50 mostdifferentially expressed host genes is shown in FIG. 4D. Expressionprofiles for 17 host genes were significantly higher ininfluenza-positive patients, (Table 16, examples in FIG. 4F) and theirexpression profiles clearly differentiated cases from controls in aprincipal component analysis with PC1 accounting for 84.7% of the totalvariance (FIG. 4E).

Gene ontology assignments for the top 50 differentially expressed geneswere also analyzed for enrichment of biological processes (FIG. 4G) andmolecular functions (FIG. 4H), demonstrating their involvement inrecognition of pathogen-associated molecular patterns and antiviral hostresponse. Most but not all of these genes are known to be differentiallyregulated in response to influenza virus or other viral infections invitro or in peripheral blood of patients. Together, these resultsdemonstrate the accuracy and power for discovery and a potential futurediagnostic application of Taxonomer's combined pathogen detection andhost response profiling.

Taxonomer's performance was compared to Sailfish and Cufflinks usingsynthetic RNA-seq reads (2×76 bp, n=15,000,000) generated with the FluxSimulator tool (see Griebel, T. et al. Modelling and simulating genericRNA-Seq experiments with the flux simulator. Nucleic acids research 40,10073-10083 (2012)); see Table 17 for parameters. TopHat (see Trapnell,C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctionswith RNA-Seq. Bioinformatics 25, 1105-1111 (2009)) was used to producealignments for Cufflinks. Like Taxonomer, Sailfish does not needexternal alignment information.

TABLE 16 Genes (n = 17) that are differentially regulated innasopharyngeal and oropharyngeal swabs from children with pneumonia whotested positive for influenza virus (n = 4) compared to asymptomaticcontrols (n = 40). Read counts and p-values (raw and adjusted) areshown. A-controls; B-influenza Gene ID Base Base Fold Mean A Mean BChange p p(adj) IFIT1 0.7 73.4 104.5 7.1E−19 1.5E−14 IFI6 0.5 31.4 64.86.3E−13 6.7E−09 IFIT2 2.1 135.5 63.8 7.8E−09 5.5E−05 ISG15 1.4 61.2 43.31.4E−08 6.4E−05 OASL 0.6 20.3 33.3 1.5E−08 6.4E−05 IFIT3 2.1 81.2 38.75.4E−08 1.9E−04 NT5C3A 0.7 20.1 30.7 3.3E−07 9.9E−04 MX2 1.4 27.4 19.24.0E−07 1.1E−03 IFITM1 2.4 32.8 14.0 6.4E−07 1.5E−03 CXCL10 0.6 37.364.6 9.0E−07 1.9E−03 IFI44L 1.5 26.6 17.8 1.6E−06 3.1E−03 MX1 4.2 56.513.5 1.8E−06 3.2E−03 IFIH1 1.4 21.3 15.0 9.7E−06 1.6E−02 OAS2 2.8 37.513.2 1.3E−05 1.9E−02 SAMD9 2.8 61.9 22.5 2.6E−05 3.7E−02 RSAD2 1.4 47.033.7 2.9E−05 3.8E−02 DDX58 1.1 16.6 15.3 3.9E−05 4.8E−02

TABLE 17 Flux Simulator parameters used to generate simulated RNAseqreads for benchmarking transcript assignment. Following the benchmarksused for Sailfish we filtered the transcript GTF using the gffreadutility with the flags -C -M -E and -T, as well as any transcriptsconsisting solely of Ns. The GTF was sorted using the FluxSimulatorsortGTF command and used to generate the synthetic data forbenchmarking. Stage Parameters Expression NB _MOLECULES 5000000REF_FILE_NAME Homo_sapiens_ENSMBL_37.75.gtf TSS_MEAN 50 POLYA_SCALE NaNPOLYA_SHAPE NaN Fragmentation FRAG_SUBSTRATE RNA FRAG_METHOD URFRAG_UR_ETA NaN FRAG_UR_D0 1 Reverse Transcription RTRANSCRIPTION YESRT_PRIMER RH RT_LOSSLESS YES RT_MIN 500 RT_MAX 5500 Filtering &Amplification FILTERING YES GC_MEAN NaN PCR_PROBABILITY 0.05 SequencingREAD_NUMBER 150000000 READ_LENGTH 76 PAIRED_END YES ERR_FILE 76 FASTAYES UNIQUE_IDS NO

Example 9: Identification of Infection and Contamination

In this example, the embodiment of Taxonomer described in Example 1 wasused to identify infection and contamination in a biological sample.Taxonomer was used to analyze RNA-Seq data from the plasma of patientssuspected of being infected with Ebola virus, but who had testednegative for Ebola virus (FIG. 5A). Taxonomer detected HIV, Lassa virus,Enterovirus (typed by Taxonomer as Coxsackievirus), and GB virus C. FIG.16B shows that Taxonomer classified a reported Enterovirus asEnterovirus A in plasma from a patient with suspected Ebola virusdisease in Sierra Leone (SRR1564825). Mean sequencing depth was 162×covering 96% of the reference sequence (AY421765). Analysis of amanually constructed viral consensus genome sequences identified thestrain as sharing 80% nucleotide sequence identity with Coxsackie virusA7, strain Parker. Taxonomer also detected the previously unrecognizedbacterial infections Chlamydophila psittaci and Elizabethkingiameningoseptica, which may have caused a patients' symptoms (FIGS. 5A and16 ). FIG. 16A shows that Taxonomer detected Elizabethkingiameningoseptica in sample SAMN03015718 (SRR1564828). The mean coverage ofthe 16S rRNA gene was 16,162-fold and the consensus sequence shared99.9% nucleotide sequence identity with the type strain of E.meningoseptica (AJ704540, ATCC 13253). Coverage of the bacterial 16SrRNA gene was >1,000× for both cases and pairwise sequence identities totype strain sequences were >99%, enabling reliable identification. The16S rRNA gene of C. psittaci was covered a mean of 7,035-fold with theconsensus 16S rRNA sequence from this isolate sharing 99.9% identitywith the type strain (6BC, ATCC VR-125, CPU68447). Positions of 2single-nucleotide polymorphisms are highlighted in red in FIG. 5A. C.psittaci is the causative agent of psittacosis, an uncommon zoonoticinfection acquired from birds that generally presents with fever,headache, cough and sometimes diarrhea. E. meningoseptica is aubiquitous gram-negative bacterium that characteristically causesmeningitis or sepsis in newborns, but can also infect immunocompromisedadults.

Taxonomer was employed to detect viral infection from a respiratorysample of a child with pneumonia. Reads classified as “viral” or“unknown” were assembled using Trinity (see e.g. Grabherr M G, et al.Nat Biotechnol, 2011 May 15; 29(7):644-52) into 2,325 contigs (run time6 seconds). Four of the contigs were identified as unclassified membersof the family Anelloviridae (FIG. 5B). The consensus genome sequenceshad 68.5% pairwise identity with TTV-like mini virus isolate LIL-y1(EF538880.1) and the predicted protein sequences were 44%-60% identicalwith those of strain LIL-y1. The pie chart and sunburst in FIG. 5B showcontig-level classification. Mapping reads back to amanually-constructed viral consensus genome sequence showed 14-fold meancoverage. A phylogenetic tree was constructed using the consensussequence of the novel Anellovirus (FIG. 5B) with reference sequences forTorque teno mini viruses. Torque teno virus 1 is shown as the outgroup(FIG. 17 ), demonstrating that Taxonomer is not restricted to shortreads, allowing re-analysis of contigs for still greater classificationsensitivity. The 239 reads used to generate the annelovirus contigs wereanalyzed with Afterburner in combination with Protonomer. Consistentwith the benchmark data presented in FIG. 9 , Protonomer classified 19of the 239 reads as derived from Anellovirus, whereas Protonomer incombination with Afterburner identified 89 of the 239 reads as derivedfrom Anellovirus. Protonomer did not misclassify any Anellovirus-derivedreads, whereas Afterburner misclassified 110 of the Anellovirus-derivedreads to other viral taxa.

The benchmark data in FIG. 9 shows RNA-Seq data from 23 samples known toharbor respiratory viruses (Table 14; human coronavirus, n=3;Coxsackievirus, n=1; human bocavirus, n=1; human metapneumovirus, n=2;rhinovirus, n=10; parainfluenza virus, n=2; and respiratory syncytialvirus, n=4) were binned and the ‘viral’ and ‘unclassified’ bins weretaxonomically classified by Protonomer, Afterburner, and Protonomerfollowed by Afterburner analysis of previously unclassified reads(samples as in FIGS. 4A-H, see FIGS. 14A-D and Table 6). FIG. 9A showsthat Protonomer (94.6±2.7%) and Afterburner (94.5±2.3%) had comparablesensitivity while their combination was slightly more sensitive(95.0±2.4%). FIG. 9B shows, conversely, that Protonomer (91.1±16.8%) wasslightly more specific than Afterburner (86.6±21.4%) and a combinationof the two tools (86.8±20.7%). True viral reads were determined bymapping of all reads to a manually constructed viral consensus genomesequence for each sample. FIG. 9C showed the mean analysis times were14.3±7.5 seconds (Protonomer), 27.4±21.5 seconds (Protomer/Afterburner),and 41.7±28.7 seconds (Afterburner). All tools were run on 16 CPU.

Taxonomer detected highly similar proportions of viral (influenza A froma nasopharyngeal swab) and bacterial (Mycoplasma pneumonia from abronchoalveolar lavage) pathogens in respiratory tract samples subjectedto 2 different library preparation methods and 3 differentnext-generation sequencing platforms (see FIG. 5D and FIG. 20 ). Whilethe same sequencing libraries were analyzed with the MiSeq and HiSeqinstruments, separate sequencing libraries were prepared for the IonProton instrument. Similar proportions of viral (0.43% to 0.55% of allreads) and bacterial (16S rRNA sequences representing 0.004% to 0.006%of all reads) pathogen sequences were obtained with all experimentalconditions. The presence of both pathogens was confirmed by qPCR. Witheach of the three platforms, >99% viral reads identified by Taxonomerwere classified as influenza A virus. Proportion of bacterial 16S readsidentified as Mycoplasma pneumoniae were more variable (MiSeq 69.3%,HiSeq 65.9%, Ion Proton 30.5%). These results demonstrate theversatility of Taxonomer and how it can be used with a variety ofsequencing instruments to detect previously missed pathogens and forquality control of expression profiling studies.

Taxonomer was used to detect contamination in a cell culture. RNA-seqdata was analyzed from induced pluripotent stem cell cultures with andwithout Mycoplasma contamination. Quality control of the RNA-Seq data byTaxonomer immediately highlighted bacterial contamination (pie chart)and identified the organism as M. yeatsii (99.4% sequence identity withthe type strain, MYU67946). High expression of rRNA was demonstrated by32% of RNA-Seq reads mapping to the M. yeatsii 16S rRNA gene (245,000×coverage.

Example 10: Taxonomic Classification in Education

Teachers can design genomics related curriculum around taxonomic methodsand systems described herein, such as Taxonomer as described in Example1, to allow students designing experiments, collecting samples andanalyzing with Taxonomer. Students collect soil samples, extract DNA/RNAfrom the soil samples, perform Next Generation Sequencing, then useTaxonmer to analyze taxonomic composition and then compare samplescollected from different locations.

Example 11: Taxonomic Classification for Consumers

A consumer can collect sample, either swab from mouth, skin or kitchensink, seal the sample in a zip bag, mail the sample to a sequencinglaboratory, and then analyze the sequencing result online usingtaxonomic methods and systems described herein, such as Taxonomer asdescribed in Example 1. As a non-limiting example, a dentist can obtaina sample using mouth, tooth, or gum swab to test for mouth or toothmicrobes.

Example 12: Taxonomic Classification for Food Safety and Authenticity

Food safety inspectors, food manufacturers, vendors and consumers cancheck food for contamination by examining microbial content in food, orfood authenticity by examining whether food ingredients match the label,using taxonomic methods and systems described herein, such as Taxonomeras described in Example 1. As a non-limiting example, a swab from a foodsurface or a small piece of the food can be tested.

Example 14: Taxonomic Classification for Hospital Safety andContamination Monitoring

Hospitals and health officials can monitor microbial contamination inhospital equipment, rooms, and patient belongings using taxonomicmethods and systems described herein, such as Taxonomer as described inExample 1. As a non-limiting example, a swab from equipment, belongings,wall or floor surface, can be tested for microbial contaminants. As anon-limiting example, such microbial contaminants can be multiple-drugresistant strains of microbes.

Example 15: Taxonomic Classification for Biological Product Quality andSafety Monitoring

Inspectors and consumers can monitor microbial contamination inbiological products and in biological product manufacturing processusing taxonomic methods and systems described herein, such as Taxonomeras described in Example 1. As a non-limiting example, biologicalproducts can be tested for microbial contamination. In anothernon-limiting example, cell lines or other material used for biologicalproduct manufacturing processes can be tested for host gene expressionprofiling, quality monitoring, and microbial contamination.

Example 16: Taxonomic Classification for Animal Disease Diagnostics andTreatment

A person involved in animal disease management, such as a veterinarypractitioner, a farmer, or a pet owner can diagnose or treat an animalusing taxonomic methods and systems described herein, such as Taxonomeras described in Example 1. As non-limiting examples, a mouth swab,blood, a nasalpharyngeal swab, urine, stool, or a swab from a wound sitecan be collected, sequenced, and analyzed using Taxonomer. The resultsof the analysis can be used by a veterinary medicine practitioner fordiagnostics and treatment plan development.

Example 17: Taxonomic Classification for Microbial Strain Profiling

The taxonomic methods and systems described herein, such as Taxonomer asdescribed in Example 1, can be used to profile microbial strains. ATaxonomer database can be constructed containing microbial straininformation (e.g. a bacterial database constructed from differentstrain, including multiple-drug resistant strains). For example,whole-genome DNA sequences or sequencing reads from multiple strains ofone bacterial species can be used for database construction. In anotherexample, sequences of strains can be from a virus, such as HIV, HCV,HBV, and influenza. For such applications, one could use a k-mersubtraction method to identify and retain k-mers that are uniquelydiagnostic for a particular node or leaf in the classification database;this approach may be used to remove k-mers that are common to multiplenodes or leaves that frustrate diagnostic efforts. For example, onecould specifically produce an antibiotic resistance or virulence factorclassification database that allows the unique identification of readsarising from particular resistance markers or virulence factors.

In one embodiment, detecting microbial strain is achieved by calculatingthe probability that a certain microorganism was in the sample given theprobability that one or more of its reference sequences (e.g. 16S, CDSs,etc.) were observed.

First, we can compute the number of times a kmer (K_(i)) is expected tobe seen in a given reference sequence tagged by a read due to errors asshown in Eq. 2:

|K _(i) _(obs_err) |=(E _(base) ×L _(K)×|NBRS|)×|reads_(K) _(nbrs)|  Eq. 2.

Here ‘|NBRS|’ denotes the number of kmers in the database of referencesequences that differ by a single or more nucleotide from K_(i). L_(k)is the Kmer length. E_(base) is the per-base error rate of thesequencing platform; and |reads_(K) _(nbrs) | is the number of readsthat contain those neighboring kmers. |K_(i) _(obs_err) | is the numberof times K, would be expected to be observed due to sequencing errors.

Then we can calculate the probability that K, was actually observedbecause it was actually in the sample as shown in Eq. 3:

$\begin{matrix}{{pK}_{i_{obs}} = {{1 - {\left( {P_{{❘{reads}❘}_{k_{i}}}{❘{❘❘}}} \right){for}{❘{reads}_{k_{i}}❘}}} > {{❘K_{i_{{obs}\_{err}}}❘}{else}P_{{❘{reads}❘}_{k_{i}}}}=={0.}}} & {{Eq}.3}\end{matrix}$ where1 − (P_(❘reads❘_(k_(i)))❘❘❘)

is the Gaussian expectation of observing |reads| containing K_(i) solelydue to errors.

Then we can calculate the probability that 1 or more of those readscontaining K, originated from reference Seq_(j) as shown in Eq. 4.

$\begin{matrix}{{{pR_{seq_{j}}}❘K_{i}} = {\left( {1 - \left( \frac{1}{❘{{seq}s_{K_{i}}}❘} \right)^{❘r❘}} \right).}} & {{Eq}.4}\end{matrix}$

where |Seqs_(Ki)| is the number of reference sequences in the databasethat contain Ki, and |r| is the number of reads containing K_(i). Inother words, every reference sequence having K_(i) is equally likely tohave given rise to a read containing K_(i).

The likelihood that reference Seq_(j) was observed given the probabilitythat each of its k-mers (K_(i)) was observed in the sequence reads, is arecursive conditional probability, as shown in Eq. 5.

$\begin{matrix}{{{For}{all}K_{i}{in}{Seq}_{j}-- > {pK}_{i_{{seq}_{j}}}} = {\left( {{pK}_{i_{obs}} \times \left\lbrack {{PR}_{{seq}_{j}}{❘K_{i}}} \right\rbrack} \right) \times {{pK}_{i - 1_{{seq}_{j}}}.}}} & {{Eq}.5}\end{matrix}$

The final value of the recursion gives a conditional probability thatSeq_(j) was observed based upon the probability that each of its kmerswas observed one or more times in the read dataset: pSeq_(j)|Π_(n)^(i)pK_(i) _(seq j)

In practice, this formulation can be extended to a collection of ORFs,or other reference sequences, that may comprise a bacterial stereotype,a viral genome, or a bacterial genome, or specific microorganismreference sequences, as shown in Eq. 6.

For all Seq_(j) E collection:

pCollection=Π_(c) ^(j)(pSeq_(j)|Π_(n) ^(i) pK _(i) _(Seqj) )  Eq. 6.

In one example, we applied this approach using sequence information from7 genetic loci in the S. pneumonia genome. Paired-end Illumina sequencereads of 50 (FIG. 21A), 100 (FIG. 21B) and 125 bases (FIG. 21C) inlength were simulated using the reference genome with the nativeMultilocus Sequencing Typing (MLST) loci replaced with randomly chosen,previously observed MLST alleles for each of the 7 loci. This processwas repeated for 100 simulated whole genomes under 6 different averagelevels of genome-wide coverage (1, 2, 5, 10, 25 and 50×). MLST alleleswere determined using three methods: (1) de novo assembly (“assembly”)in which reads were assembled using the Velvet assembler after which thebest allelic match was determined using BLAST to a database of knownMLST alleles; (2) Consensus read mapping (“consensus”) in which readswere mapped to the R6 reference genome using bwa followed by majorityrules consensus base calling at the MLST loci and allelic assignment;and (3) kmer-based MLST typing (“MLST” in FIGS. 21A-C) which cataloguesthe k-mer content of a read set, determines the probability of theirobservation and uses a composite likelihood framework to assign the bestallele call and composite MLST genotype. Boxplots in FIGS. 21A-C showthe number of correctly identified loci using each of the three sequencetyping methods and under the 6 different simulated coverage scenarios.The results are stratified by simulated read length with (FIG. 21A) 50bp reads, (FIG. 21B) 100 bp reads and (FIG. 21C) 125 bp reads.

We also show the fraction of correctly MLST genotyped simulated S.pneumoniae strains under different coverage scenarios in FIGS. 22A-C.Simulated Illumina reads were generated as in FIGS. 21A-C for 100 S.pneumoniae strains. The sequence reads were processed using (1) a k-merMLST typing pipeline (“K-mer”, black), (2) de novo assembly (“Assembly”,blue), and (3) read mapping and consensus calling (“Consensus”, green)as detailed in FIGS. 21A-C. After determining the allelic composition ofeach locus using either the composite likelihood approach (1) or findingthe BLAST based highest scoring pair (HSP; 2&3), we determined thefraction of correctly identified MLST genotypes under 6 differentsimulated coverage scenarios from paired-end reads of 50 (FIG. 22A), 100(FIG. 22B) and 125 (FIG. 22C) bases in length.

In another embodiment, one can determine what organisms are present in aset of query sequences, e.g. sequencing reads from Next GenerationSequencing, given a database of reference sequences of known organismsby summarizing how well an organism's reference sequences arerepresented by the query sequences into a single score, or rank metric.

This may be achieved in two steps. First, one can k-merize the querysequences, and place the k-mers over the references sequences at thematching locations. Second, for a single organism the dot product iscomputed between the relative uniqueness of a k-mer location in thereference sequence and the binarized k-mer coverage (where binarizedcoverage=1 if k-mer coverage >0 and 0 otherwise). K-mer uniqueness iscalculated as the fraction of a particular k-mer, ki, in a specificorganism compared to the count of ki in the entire database. Forexample, if ki is found 3 times in a particular organism and ki iscounted present 10 times in the database, the uniqueness of ki in theorganism would be 3/10 or 0.3. As an example, suppose a referencesequence contains three k-mers: k1, k2, and k3. These three k-mers havea relative uniqueness of u1, u2, and u3. These k-mers have binarizedcoverage of bc1, bc2, and bc3. The dot product would then be computed as(u1*bc1+u2*bc2+u3*bc3). Next one can calculate the proportion of basesfor a single organism's reference sequences that have nonzero coveragecompared to the total number of bases in the organism's referencesequences, call this term pi. This information can be summarized using aweighted sum into a single number called the rank metric. Given theweights w1 and w2, we can calculate the rank metric asw1*(u1*bc1+u2*bc2+u3*bc3)+w2*pi. The rank metric is a condensed summaryof how well an organism's reference sequences are represented by thequery sequences. The weight is a number between 0 and 1, and the sum ofall weights, in this example w1+w2, is 1. In practice, one can usesimulation and machine learning methods, e.g. random forests, to computeoptimal weights with training data sets or on extensive simulations, anddiscover rank metric cutoffs that allow making informed calls aboutwhich organisms' DNA and/or RNA is present or absent in a given set ofquery sequences.

In one embodiment, the cutoff of positively identifying an organism is afixed value. In another embodiment, the cutoff of positively identifyingan organism varies depending on the rank metric of other organismspresented in the query sequences. Because of homology or sequencesimilarity between different organisms, a k-mer from the sequence of oneorganism can match to sequences of other organisms, therefore a set ofk-mers from one organism can generate different rank metrics values to aset of different organisms. The cutoff can be defined to be greater thanrank metric values of the set of predefined different organisms.

Example 18: Taxonomic Classification for Tumor Profiling

The taxonomic methods and systems described herein, such as Taxonomer asdescribed in Example 1, can be used to profile tumor-derived DNA. DNAsequences of different tumors from different tissues can be used toconstruct a tumor database, which can then be used in Taxonomer foranalyzing sequences obtained from tumor tissues. Taxonomer can assigneach read to a most likely tumor type using a tumor database soconstructed. As non-limiting examples, a whole genome assembly orgenomic sequence reads can be used for database construction.

Example 19: Taxonomic Classification for Forensic Profiling

The taxonomic methods and systems described herein, such as Taxonomer asdescribed in Example 1, can be used to assign sequencing reads toindividual of a population if such database constructed from genomicsequences of individuals from the population. The population used toconstruct a database, in some examples, people with criminal records,aliens who enter the United States, or the people living in a country.DNA material from a crime scene can be sequenced and analyzed withTaxonomer as described above to determine if a DNA sample derived froman individual is present.

Example 20: Taxonomic Classification for Genetic Testing

The taxonomic methods and systems described herein, such as Taxonomer asdescribed in Example 1, can be used with an artificial k-mer databasecontaining all simulated k-mers with DNA polymorphisms associated withdiseases or conditions. Taxonomer can then be used to assign sequencingreads from an individual to particular disease-causing genotypes. TheDNA being analyzed with Taxonomer can, as an example, be fetal-derivedcell-free DNA from a pregnant woman for prenatal screening, or the DNAcan be derived from a mouth swab, saliva, or blood from an individualundergoing genetic testing.

Example 21: Use of k-Mer Weights or Other Measure to Calculate aPairwise Distance Between Every Sequence in a Classification Database

The taxonomic methods and systems described herein, such as Taxonomer asdescribed in Example 1, can be used to calculate a pairwise distancebetween every sequence in a classification database. Such a pairwisedistance can be used to identify sequences with discordant neighbors,thus identifying misannotated or mislocated sequences within apreviously existing taxonomy or classification database. The pairwisedistances can be used to produce a new phylogenetic tree, having theoptimal structure for accurate classification and diagnosis. Bootstrapor other node-confidence metrics can be used to collapse polytomies andpoorly resolved nodes in a taxonomic tree, for reasons such as speedingclassification, improving classification, or improving diagnosticaccuracy. The abovementioned database can be used to classify referencesequences previously annotated as derived from a common clinical strain,isolate, or otherwise named organism as used in diagnostic parlance.This name can be associated with appropriate leaves and nodes of thedatabase in order to associate the taxonomic associations in thedatabase with commonly used diagnostic names for organisms. A similarprocess can be used to produce a protein database organized by sequencesimilarity such that the different branches correspond to differenttypes of genes or proteins, such as different functions, GOclassifications, gene-families, etc. . . . . Similarly to as describedfor the nucleotide taxonomic database, the name of the organism theprotein(s) are derived from can be attached to the appropriate leaf andnode in the protein taxonomy. This approach can be used to distinguishbetween closely related pathogens, such as E. coli and Shigella orAnthrax and other Bacilli that cannot be distinguished using 16Ssequence alone. For example, other proteins and nucleotide sequences canbe used to confirm the presence of a particular pathogen, wherein thepresence is indicated by a first piece of data, such as the 16Ssequence, and reads are also present that are classified as ataxon-specific protein. These confirmatory findings can be reported toimprove a diagnosis. For example, viruses can be classified using theabove process to produce a protein database organized upon sequencesimilarity or pairwise distance such that different branches wouldcorrespond to different types of viral genes, such that differentbranches would correspond to different types of viral genes.Non-limiting examples of these genes can be gag, env, or pol. The namesof the viruses from which these sequences are derived can be attached tothe appropriate leaves and nodes of the taxonomic structure. The viraldatabase can be used to establish what portion of a viral or viraltaxon's genome is present in the query dataset. For example, one couldtest that HIV was detected in a sample and specifically the samplecontain HIV gag and pol sequences, but not HIV env sequences.

Example 22: Effect of Joint Analysis of Mate Pairs on Read Binning

In this example, the effect of joint analysis of mate pairs on readbinning using Taxonomer as described in Example 1 is described. Sample 2from FIG. 18 was analyzed with the ‘Binner’ module either using onlyread 1, only read 2, or analyzing both reads jointly afterconcatenation. Concatenation of mate pairs results in fewer reads withunknown (−13%) and ambiguous bin assignment (−19%) compared to resultsbased on read 1 alone. The largest relative change is seen for phages(+58%), bacteria (+19%), fungi (+18%), and ‘Other’ (+17%), see Table 18.

TABLE 18 Effect of joint analysis of mate pairs on read binning. Read 1Read 2 Concatenated Bin Reads % Reads % Reads % Change Human 1,907,75932.2 1,902,495 32.1 1,944,049 32.8  +2% Bacterial 952,402 16.1 953,95716.1 1,134,751 19.1 +19% Fungal 274,232 4.6 274,800 4.6 324,874 5.5 +18%Viral 4,792 0.1 4,799 0.1 5,470 0.1 +14% Phage 1,292 0.0 1,434 0.0 2,0410.0 +58% Ambiguous 840,208 14.2 841,933 14.2 681,413 11.5 −19% Other449,774 7.6 452,366 7.6 528,242 8.9 +17% Unknown 1,498,009 25.31,496,761 25.2 1,308,191 22.1 −13%

While preferred embodiments of the present invention have been shown anddescribed herein, it will be obvious to those skilled in the art thatsuch embodiments are provided by way of example only. Numerousvariations, changes, and substitutions will now occur to those skilledin the art without departing from the invention. It should be understoodthat various alternatives to the embodiments of the invention describedherein may be employed in practicing the invention. It is intended thatthe following claims define the scope of the invention and that methodsand structures within the scope of these claims and their equivalents becovered thereby.

1.-46. (canceled)
 47. A method of detecting a plurality of taxa in asample, the method comprising providing sequencing reads for a pluralityof polynucleotides from the sample, and for each sequencing read: (a)assigning the sequencing read to a first taxonomic group based on afirst sequence comparison between the sequencing read and a firstplurality of polynucleotide sequences from the different first taxonomicgroups, wherein at least two sequencing reads are assigned to differenttaxonomic groups; (b) performing with a computer system a secondsequence comparison between the sequencing read and a second pluralityof polynucleotide sequences corresponding to members of the firsttaxonomic group, wherein the comparison comprises counting a number ofk-mers within the sequencing read of at least 5 nucleotides in lengththat exactly match one or more k-mers within a reference sequence in thesecond plurality of polynucleotide sequences; (c) classifying thesequencing read as belonging to a second taxonomic group that is morespecific than the first taxonomic group if a measure of similaritybetween the sequencing read and reference sequence is above a firstthreshold level; (d) if no similarity above the first threshold level isidentified in (c), classifying the sequencing read as belonging to thesecond taxonomic group based on similarity above a second thresholdlevel determined by comparing with the computer system a sequencederived from translating the sequencing read and a third set ofreference sequences corresponding to amino acid sequences of members ofthe first taxonomic group; and (e) identifying the presence, absence, orabundance of the plurality of taxa in the sample based on theclassifying of the sequencing reads.
 48. The method of claim 47, whereinstep (b) further comprises calculating k-mer weights as measures of howlikely it is that k-mers within the sequencing read are derived from areference sequence in the second plurality of polynucleotide sequences.49. The method of claim 47, wherein the third set of reference sequencesconsist of polynucleotide sequences derived from reverse-translating thecorresponding amino acid sequences.
 50. The method of claim 47, furthercomprising performing with the computer system a relaxed sequencecomparison between the sequencing read and the second plurality ofpolynucleotide sequences if the similarity in (d) is below the secondthreshold, wherein the relaxed sequence comparison is less stringentthan the second sequence comparison.
 51. The method of claim 47, whereinclassifying the sequencing read in step (c) comprises resolving a tiebetween two or more possible taxonomic groups based on a k-mer weight asa measure of how likely it is that the sequencing read corresponds to apolynucleotide from an ancestor of one of the possible taxonomic groups.52. The method of claim 47, further comprising diagnosing a conditionbased on a degree of similarity between the plurality of taxa detectedin the sample and a biological signature for the condition.
 53. Themethod of claim 52, wherein the condition is contamination of thesample.
 54. The method of claim 52, wherein the condition is aninfection of a subject.
 55. The method of claim 54, wherein infection isassessed based on the presence or amount of (i) sequences of hosttranscripts; and/or (ii) sequences of one or more infectious agents. 56.The method of claim 54, further comprising monitoring treatment in aninfected subject by detecting presence, absence, or abundance of aplurality of taxa in samples from the infected subject at multiple timesafter beginning treatment.
 57. The method of claim 56, furthercomprising changing treatment of the infected subject based on resultsof the monitoring.
 58. The method of claim 47, wherein step (c) furthercomprises classifying the sequencing read as corresponding to a genetranscript if the measure of similarity between the sequencing read andreference sequence is above the first threshold level.
 59. The method ofclaim 58, further comprising diagnosing a condition based on a degree ofsimilarity between the plurality of taxa detected in the sample and abiological signature for the condition.
 60. The method of claim 47,wherein step (a) comprises assigning sequencing reads to two or moretaxa selected from bacteria, viruses, fungi, or humans.
 61. The methodof claim 47, wherein a sequencing read classified as belonging to thesecond taxonomic group and not present among the group of sequencescorresponding to the second taxonomic group is added to the group ofsequences corresponding to the second taxonomic group for use in latersequence comparisons.
 62. The method of claim 47, wherein the secondplurality of nucleotide sequences comprises marker gene sequences fortaxonomic classification of bacterial sequences.
 63. The method of claim62, wherein the marker gene sequences comprise 16S rRNA sequences. 64.The method of claim 47, wherein the second plurality of nucleotidesequences comprises sequences of human transcripts. 65.-69. (canceled)